1870-9044

S1870-90442009000200006

Japan

00 12 2009

40 29 38

Special section: Information Retrieval and Natural Language Processing

Improving Named Entity Extraction Accuracy using Unlabeled Data and Several Extractors

Tomoya Iwakura and Seishi Okamoto

Fujitsu Laboratories Ltd., 1–1, Kamikodanaka 4–chome, Nakahara–ku, Kawasaki 211–8588, Japan. (iwakura.tomoya@jp.fujitsu.com, seishi@jp.fujitsu.com)

Manuscript received November 4, 2008.
Manuscript accepted for publication August 25, 2009.

]]>

Abstract

This paper proposes feature augmentation methods using unlabeled data and several Named Entity (NE) extractors. We collect NE–related information of each word (which we call NE–related labels) from unlabeled data by using NE extractors. NE–related labels which we collect include candidate NE class labels of each word and NE class labels of co–occurring words. To accurately collect the NE–related labels from unlabeled data, we consider methods to collect NE–related labels by using outputs of several NE extractors. We use NE–related labels as additional features for creating new NE extractors. We apply our NE extraction methods using the NE–related labels to IREX Japanese NE extraction task. The experimental results show better accuracy than the previous results obtained with NE extractors using handcrafted resources.

Key words: Named entity recognition, unlabeled data, combination of extractors.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] Y. Takemoto, T. Fukushima, and H. Yamada, "A Japanese named entity extraction system based on building a large–scale and high quality dictionary and pattern–matching rules (in Japanese)," in IPSJ Journal, 42(6), 2001, pp. 1580–1591. [ Links ]

]]>

[2] M. Collins and Y. Singer, "Unsupervised models for named entity classification," in Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [Online]. Available: citeseer.ist.psu.edu/collins99unsupervised.html [ Links ]

[3] K. Uchimoto, Q. Ma, M. Murata, H. Ozaku, M. Utiyama, and H. Isahara, "Named entity extraction based on a maximum entropy model and transformati on rules." in Proc. of the ACL 2000, 2000, pp. 326–335. [ Links ]

[4] H. Yamada, T. Kudoh, and Y. Matsumoto, "Japanese named entity extraction using Support Vector Machine (in Japanese)," in IPSJ Journal, 43(1), 2002, pp. 44–53. [ Links ]

[5] X. Carreras, L. Màrques, and L. Padró, "Named entity extraction using adaboost," in Proc. of CoNLL–2002. Taipei, Taiwan, 2002, pp. 167–170. [ Links ]

[6] H. Isozaki and H. Kazawa, "Speeding up named entity recognition based on Support Vector Machines (in Japanese)," in IPSJ SIG notes NL–149–1, 2002, pp. 1–8. [ Links ]

[7] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang, "Named entity recognition through classifier combination," in Proc. of CoNLL–2003, 2003, pp. 168–171. [ Links ]

[8] M. Asahara and Y. Matsumoto, "Japanese named entity extraction with redundant morphological analysis," in Proc. of HLT–NAACL 2003, 2003, pp. 8–15. [ Links ]

[9] K. Nakano and Y. Hirai, "Japanese named entity extraction with bunsetsu features (in Japanese)," in IPSJ Journal, 45(3), 2004, pp. 934–941. [ Links ]

[10] S. Miller, J. Guinness, and A. Zamanian, "Name tagging with word clusters and discriminative training." in HLT–NAACL, 2004, pp. 337-342. [ Links ]

[11] D. Freitag, "Trained named entity recognition using distributional clusters," in Proc. of EMNLP 2004. Association for Computational Linguistics, July 2004, pp. 262–269. [ Links ]

[12] R. Ando and T. Zhang, "A high–performance semi–supervised learning method for text chunking," in Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics. Ann Arbor, Michigan: Association for Computational Linguistics, June 2005, pp. 1–9. [Online]. Available: http://www.aclweb.org/anthology/P/P05/P05–1001 [ Links ]

[13] D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods," in Proc. of ACL–1995, 1995, pp. 189–196. [ Links ]

[14] E. Riloff and R. Jones, "Learning dictionaries for information extraction by multi–level bootstrapping," in AAAI/IAAI, 1999, pp. 474–479. [Online]. Available: citeseer.ist.psu.edu/article/riloff99learning.html [ Links ]

[15] A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co–training," in Proc. of the 11th COLT, 1998, pp. 92–100. [ Links ]

[16] R. K. Ando, "Semantic lexicon construction: Learning from unlabeled data via spectral analysis," in Proc. of CoNLL–2004. Boston, MA, USA, 2004, pp. 9–16. [ Links ]

[17] C. IREX, Proc. of the IREX workshop, 1999. [ Links ]

[18] L. Ramshaw and M. Marcus, "Text chunking using transformation–based learning," in Proc. of the Third Workshop on Very Large Corpora. Association for Computational Linguistics, 1995, pp. 82–94. [Online]. Available: citeseer.ist.psu.edu/article/ramshaw95text.html [ Links ]

[19] E. Tjong Kim Sang and J. Veenstra, "Representing text chunks." in Proc. of EACL '99, Bergen, Norway, 1999. [Online]. Available: http://www.cnts.ua.ac.be/Publications/1999/TV99 [ Links ]

[20] T. Kudo and Y. Matsumoto, "Chunking with Support Vector Machines," in Proc. of NAACL 2001, 2001. [ Links ]

[21] ––––––––––, "Fast methods for kernel–based text analysis," in Proc. of ACL–2003, 2003, pp. 24–31. [ Links ]

[22] V. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998. [ Links ]

[23] J. C. Platt, Probabilities for SV machines, A. J. Smola, P. L. Bartlett, B. Sch¨olkopf, and D. Schuurmans, Eds. MIT Press, 2000. [ Links ]

[24] T. Utsuro, M. Sassano, and K. Uchimoto, "Combining outputs of multiple Japanese named entity chunkers by stacking," in Proc. of EMNLP 2002, 2002, pp. 281–288. [ Links ]

[25] R. Sasano and S. Kurohashi, "Japanese named entity recognition using structural natural language processing," in Proc. of IJCNLP'08, 2008, pp. 607–612. [ Links ]

[26] J. Kazama and K. Torisawa, "Inducing gazetteers for named entity recognition by large–scale clustering of dependency relations," in Proc. of ACL–08: HLT, 2008, pp. 407–415. [ Links ]

[27] S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa, K. Ogura, Y. Ooyama, and Y. Hayashi, Goi–Taikei –A Japanese Lexicon CDROM. Iwanami Shoten, 1999. [ Links ]

]]>

2001 42 6 6

1580-1591

1999

2000

326-335

2002 43 1 1

44-53

2002

167-170

2002

1-8

2003

168-171

2003

8-15

2004 45 3 3

934-941

2004

337-342

July 2 00

262-269

June 2 00

1-9

1995

189-196

1999

474-479

1998

92-100

2004

9-16

C. IREX 1999

1995

82-94

1999

2001

2003

24-31

1998

2000

2002

281-288

2008

607-612

2008

407-415

1999