Simple Semi-supervised Learning for Chinese Word Segmentation and Pos Tagging

Li, Xinxin; Wang, Xuan; Waqas Anwar, Muhammad

Research Article

Simple Semi-supervised Learning for Chinese Word Segmentation and Pos Tagging

Xinxin Li
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China

Xuan Wang
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China

Muhammad Waqas Anwar
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China

ABSTRACT

Strategies of unlabeled data selection are important for semi-supervised learning of natural language processing tasks. To increase the accuracy and diversity of new labeled data, plenty of methods have been proposed, such as ensemble-based self-training, co-training and tri-training methods. In this paper, we propose a simple and effective semi-supervised algorithm for Chinese word segmentation and part-of-speech tagging problem which selects new labeled data agreed by two different approaches: character-based and word-based models. Theoretical and experimental analysis verifies that sentences with same annotation on both models are more accurate than those generated by single models and are suitable for semi-supervised learning as additional data. Experimental results on Chinese Treebank 5.0 demonstrate that our semi-supervised approach is comparable with the best reported semi-supervised approach which employs complex feature engineering.

PDF References Citation

How to cite this article

Xinxin Li, Xuan Wang and Muhammad Waqas Anwar, 2013. Simple Semi-supervised Learning for Chinese Word Segmentation and Pos Tagging. Information Technology Journal, 12: 5955-5961.

DOI: 10.3923/itj.2013.5955.5961

URL: https://scialert.net/abstract/?doi=itj.2013.5955.5961

REFERENCES

Abney, S., 2002. Bootstrapping. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA., USA., pp: 360-367.
Blum, A. and T. Mitchell, 1998. Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory, July 24-26, 1998, Wisconsin, USA., pp: 92-100.
Clark, S., J.R. Curran and M. Osborne, 2003. Bootstrapping POS taggers using unlabelled data. Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, May 31-June 1, 2003, Edmonton, Canada, pp: 49-55.
CrossRef
Collins, M., 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Volume 10, July 6-7, 2002, Philadelphia, PA., USA., pp: 1-8.
CrossRef
Jiang, W., L. Huang, Q. Liu, and Y. Lu, 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of the ACL-08: HLT, June 2008, Columbus, Ohio, pp: 897-904.
Jiang, W., L. Huang and Q. Liu, 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: A case study. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1, August 2-7, 2009, Suntec, Singapore, pp: 522-530.
Kruengkrai, C., K. Uchimoto, J. Kazamam, Y. Wang, K. Torisawa and H. Isahara, 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and pos tagging. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1, August 2-7, 2009, Suntec, Singapore, pp: 513-521.
Li, X., X. Wang and L. Yao, 2011. Joint decoding for Chinese word segmentation and POS tagging using character-based and word-based discriminative models. Proceedings of the International Conference on Asian Language Processing, November 15-17, 2011, Penang, Malaysia, pp: 11-14.
CrossRef
McClosky, D., E. Charniak and M. Johnson, 2008. When is self-training effective for parsing? Proceedings of the 22nd International Conference on Computational Linguistics, August 18-22, 2008, Manchester, UK., pp: 561-568.
Ng, H.T. and J.K. Low, 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based? Proceedings of the Conference on Empirical Methods in Natural Language Processing, July 27-31, 2004, Barcelona, Spain, pp: 277-284.
Spoustova, D., J. Hajic, J. Raab and M. Spousta, 2009. Semi-supervised training for the averaged perceptron POS tagger. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, March 30-April 3, 2009, Athens, Greece, pp: 763-771.
Sun, W., 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 19-24, 2011, Portland, Oregon, USA., pp: 1385-1394.
Sun, W. and X. Wan, 2012. Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, July 8-14, 2012, Jeju Island, Korea, pp: 232-241.
Wang, Y., J. Kazama, Y. Tsuruoka, W. Chen, Y. Zhang and K. Torisawa, 2011. Improving Chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. Proceedings of 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand, pp: 309-317.
Xue, N., F. Xia, F. Chiou and M. Palmer, 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Nat. Language Eng., 11: 207-238.
CrossRef
Yarowsky, D., 1995. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, June 26-30, 1995, Cambridge, MA., USA., pp: 189-196.
Zhang, Y. and S. Clark, 2008. Joint word segmentation and POS tagging using a single perceptron. Proceedings of the ACL-08: HLT, June 2008, Columbus, Ohio, pp: 888-896.
Zhang, Y. and S. Clark, 2010. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. Proceedings of the Conference on Empirical Methods in Natural Language Processing, October 9-11, 2010, Cambridge, MA., USA., pp: 843-852.

Information Technology Journal

Research Article

Simple Semi-supervised Learning for Chinese Word Segmentation and Pos Tagging

ABSTRACT

How to cite this article

Search

REFERENCES

Search

Leave a Comment