Xinxin Li
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China
Xuan Wang
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China
Muhammad Waqas Anwar
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China
ABSTRACT
Strategies of unlabeled data selection are important for semi-supervised learning of natural language processing tasks. To increase the accuracy and diversity of new labeled data, plenty of methods have been proposed, such as ensemble-based self-training, co-training and tri-training methods. In this paper, we propose a simple and effective semi-supervised algorithm for Chinese word segmentation and part-of-speech tagging problem which selects new labeled data agreed by two different approaches: character-based and word-based models. Theoretical and experimental analysis verifies that sentences with same annotation on both models are more accurate than those generated by single models and are suitable for semi-supervised learning as additional data. Experimental results on Chinese Treebank 5.0 demonstrate that our semi-supervised approach is comparable with the best reported semi-supervised approach which employs complex feature engineering.
PDF References Citation
How to cite this article
Xinxin Li, Xuan Wang and Muhammad Waqas Anwar, 2013. Simple Semi-supervised Learning for Chinese Word Segmentation and Pos Tagging. Information Technology Journal, 12: 5955-5961.
DOI: 10.3923/itj.2013.5955.5961
URL: https://scialert.net/abstract/?doi=itj.2013.5955.5961
DOI: 10.3923/itj.2013.5955.5961
URL: https://scialert.net/abstract/?doi=itj.2013.5955.5961
REFERENCES
- Clark, S., J.R. Curran and M. Osborne, 2003. Bootstrapping POS taggers using unlabelled data. Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, May 31-June 1, 2003, Edmonton, Canada, pp: 49-55.
CrossRef - Collins, M., 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Volume 10, July 6-7, 2002, Philadelphia, PA., USA., pp: 1-8.
CrossRef - Jiang, W., L. Huang and Q. Liu, 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: A case study. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1, August 2-7, 2009, Suntec, Singapore, pp: 522-530.
- Kruengkrai, C., K. Uchimoto, J. Kazamam, Y. Wang, K. Torisawa and H. Isahara, 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and pos tagging. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1, August 2-7, 2009, Suntec, Singapore, pp: 513-521.
- Li, X., X. Wang and L. Yao, 2011. Joint decoding for Chinese word segmentation and POS tagging using character-based and word-based discriminative models. Proceedings of the International Conference on Asian Language Processing, November 15-17, 2011, Penang, Malaysia, pp: 11-14.
CrossRef - Wang, Y., J. Kazama, Y. Tsuruoka, W. Chen, Y. Zhang and K. Torisawa, 2011. Improving Chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. Proceedings of 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand, pp: 309-317.
- Xue, N., F. Xia, F. Chiou and M. Palmer, 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Nat. Language Eng., 11: 207-238.
CrossRef