1405-5546

S1405-55462013000200006

India

00 06 2013

17 2 161 168

Artículos

Generation of Bilingual Dictionaries using Structural Properties

Generación de diccionarios bilingües usando las propiedades estructurales

Ajay Dubey¹ and Vasudeva Varma²

¹ Search and Information Extraction Laboratory, International Institute of Information Technology, Hyderabad, AP, 500032, India ajay.dubey@research.iiit.ac.in

]]> ² Search and Information Extraction Laboratory, International Institute of Information Technology, Hyderabad, AP, 500032, India vv@iiit.ac.in

Article received on 06/12/2012
Accepted on 11/01/2013.

Abstract

Building bilingual dictionaries from Wikipedia has been extensively studied in the area of computation linguistics. These dictionaries play a crucial role in Natural Language Processing(NLP) applications like Cross-Lingual Information Retrieval, Machine Translation and Named Entity Recognition. To build these dictionaries, most of the existing approaches use information present in Wikipedia titles, info-boxes and categories. Interestingly, not many use the structural properties of a document like sections, subsections, etc. In this work we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. Though we present our experiments only for Hindi, our approach is language independent and can be easily extended to other languages. The major contribution of our work is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. We evaluate our dictionary using manually computed precision. We generated a massive list of 72k tokens using our approach with 0.75 precision.

Keywords: Bilingual dictionary, comparable corpora, structural elements.

Resumen

]]> Compilación de diccionarios bilingües usando Wikipedia ha sido estudiada mucho en la lingüística computacional. Estos diccionarios juegan un papel crítico en tales aplicaciones del procesamiento de lenguaje natural (PLN) como recuperación de información inter-lingüística, traducción automática y reconocimiento de nombres. La mayoría de los enfoques existentes para la construcción de estos diccionarios usa la información presente en títulos de Wikipedia, info-boxes y categorías. Es interesante que pocos investigadores hayan usado las propiedades estructurales de documentos tales como secciones, sub-secciones, etc. Este trabajo utiliza las propiedades estructurales de documentos para construir un diccionario bilingüe inglés-hindi. La intuición principal en la cual se basa este enfoque es el hecho de que la discusión de un cierto tema en idiomas diferentes puede tener los elementes estructurales similares. Los experimentos se realizaron sólo para hindi, pero el enfoque no depende del idioma particular y puede ser extendida fácilmente a otros idiomas. La mayor aportación de este trabajo es la inclusión en el diccionario las palabras que son nombres traducidos y transliterados. El diccionario fue evaluado mediante la precisión calculada manualmente. Se generó una lista muy grande de 72K tokens usando el enfoque propuesto con la precisión de 0.75.

Palabras clave: Diccionario bilingüe, corpus comparable, elementos estructurales.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Adafre, S. & de Rijke, M. (2006). Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. 62-69. [ Links ]

2. Bharadwaj, R. & Varma, V. (2011). Language independent identification of parallel sentences using wikipedia. In Proceedings of the 20th international conference companion on World wide web. ACM, 11-12. [ Links ]

]]>

3. Breen, J. (2004). Jmdict: a japanese-multilingual dictionary. In Proceedings of the Workshop on Multilingual Linguistic Ressources. Association for Computational Linguistics, 71-79. [ Links ]

4. Brown, P., Cocke, J., Pietra, S., Pietra, V., Jelinek, F., Lafferty, J., Mercer, R., & Roossin, P. (1990). A statistical approach to machine translation. Computational linguistics, 16(2), 79-85. [ Links ]

5. Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). An approach for extracting bilingual terminology from wikipedia. In Database Systems for Advanced Applications. Springer, 380-392. [ Links ]

6. Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2009). Improving the extraction of bilingual terminology from wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 5(4), 31. [ Links ]

7. Fatiha, S. (2011). Extracting the multilingual terminology from a web-based encyclopedia. In Research Challenges in Information Science (RCIS), 2011 Fifth International Conference on. IEEE, 1-5. [ Links ]

]]>

8. Fung, P. & McKeown, K. (1997). A technical word-and term-translation aid using noisy parallel corpora across language groups. Machine Translation, 12(1), 53-87. [ Links ]

9. Kay, M. & Röscheisen, M. (1993). Text-translation alignment. computational Linguistics, 19(1), 121-142. [ Links ]

10. Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., & Schütze, H. (2010). A linguistically grounded graph model for bilingual lexicon extraction. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 614-622. [ Links ]

11. Melamed, I. (1997). A word-to-word model of translational equivalence. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 490-497. [ Links ]

12. Mohammadi, M. & GhasemAghaee, N. (2010). Building bilingual parallel corpora based on wikipedia. In Computer Engineering and Applications (ICCEA), 2010 Second International Conference on, volume 2. IEEE, 264-268. [ Links ]

]]>

13. Nayan, A., Rao, B., Singh, P., Sanyal, S., & Sanyal, R. (2008). Named entity recognition for indian languages. NER for South and South East Asian Languages, 97. [ Links ]

14. Och, F. & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational linguistics, 29(1), 19-51. [ Links ]

15. Rahimi, Z. & Shakery, A. (2010). Creating a wikipedia-based persian-english word association dictionary. In Telecommunications (IST), 2010 5th International Symposium on. IEEE, 562-567. [ Links ]

16. Rohit Bharadwaj, G., Tandon, N., & Varma, V. (2009). An iterative approach to extract dictionaries from wikipedia for under-resourced languages. Proc. ICON2010. [ Links ]

17. Smith, J., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 403-411. [ Links ]

]]>

18. Soderland, S., Etzioni, O., Weld, D., Skinner, M., Bilmes, J., et al. (2009). Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, 262-270. [ Links ]

19. Tyers, F. & Pienaar, J. (2008). Extracting bilingual word pairs from wikipedia. Collaboration: interoperability between people in the creation of language resources for less-resourced languages, 19. [ Links ]

20. Zobel, J. & Dart, P. (1996). Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 166-172. [ Links ]

]]>

2006

62-69

2011

11-12

2004

71-79

1990 16 2 2

79-85

2008

380-392

2009 5 4 4

2011

1-5

1997 12 1 1

53-87

1993 19 1 1

121-142

2010

614-622

1997

490-497

2010 2

264-268

2008

2003 29 1 1

19-51

562-567

2009

2010

403-411

2009

262-270

2008

1996

166-172