SciELO - Scientific Electronic Library Online

 
vol.12 número1Investigación en Computación Ambiental para la Salud: Retos, Oportunidades y Experiencias índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Computación y Sistemas

versión impresa ISSN 1405-5546

Comp. y Sist. vol.12 no.1 México jul./sep. 2008

 

Resumen de tesis doctoral

 

Automatic Semantic Role Labeling using Selectional Preferences with Very Large Corpora

 

Determinación Automática de Roles Semánticos usando Preferencias de Selección sobre Corpus muy Grandes

 

Graduated: Hiram Calvo
Center for Research in Computing (CIC)
National Polytechnic Institute (IPN)
Mexico City, Mexico, 07738

e–mails: hcalvo@cic.ipn.mx, hiramcalvo@gmail.com

Advisor: Dr. Alexander Gelbukh
Computing Research Center (CIC)
National Polytechnic Institute (IPN)
Mexico City, Mexico, 07738

www.gelbukh.com

 

Graduated on June 19th, 2006

 

Abstract

We present a method for recognizing semantic roles for Spanish sentences. This method is based on dependency parsing using heuristic rules to infer dependency relationships between words, and word co–occurrence statistics (learnt in an unsupervised manner) to resolve ambiguities such as prepositional phrase attachment. If a complete parse cannot be produced, a partial structure is built with some (if not all) dependency relations identified. Evaluation shows that in spite of its simplicity, the parser's accuracy is superior to the available existing parsers for Spanish. Though certain grammar rules, as well as the lexical resources used, are specific for Spanish, the suggested approach is language–independent. A particularly interesting ambiguity which we have decided to analyze deeper, is the Prepositional Phrase Attachment Disambiguation.

The system uses an ordered set of simple heuristic rules for determining iteratively the relationships between words to which a governor has not been yet assigned. For resolving certain cases of ambiguity we use cooccurrence statistics of words collected previously in an unsupervised manner, whether it be from big corpora, or from the Web (through a search engine such as Google). Collecting these statistics is done by using Selectional Preferences.

In order to evaluate our system, we developed a Method for Converting a Gold Standard from a constituent format to a dependency format. Additionally, each one of the modules of the system (Selectional Preferences Acquisition and Prepositional Phrase Attachment Disambiguation), is evaluated in a separate and independent way to verify that they work properly. Finally we present some Applications of our system: Word Sense Disambiguation and Linguistic Steganography.

Keywords: dependency parsing, pp attachment disambiguation, constituent to dependency conversion, heuristic rules, hybrid parser, selectional preferences.

 

Resumen

Se presenta un método para reconocer los roles semánticos de las oraciones en español, es decir, identificar el papel que tiene cada uno de los elementos de la oración. Este método se basa en análisis de dependencias usando reglas heurísticas para inferir relaciones de dependencia entre palabras, así como estadísticas de co–ocurrencia (aprendidas de manera no supervisada) para resolver ambigüedades como la adjunción de sintagma preposicional. Si no se puede producir un análisis completo, se construye una estructura parcial con algunas (si no todas) relaciones de dependencia identificadas. La evaluación muestra que a pesar de su simplicidad, la precisión del analizador es superior a aquella de los analizadores existentes actuales para el español. A pesar de que ciertas reglas gramaticales y los recursos léxicos usados son específicos para el español, el enfoque sugerido es independiente del lenguaje. Una ambigüedad interesante que hemos decidido analizar a mayor profundidad, es la desambiguación de sintagma preposicional.

El sistema usa un conjunto ordenado de reglas heurísticas simples para determinar iterativamente las relaciones entre palabras para las cuales no se les ha asignado aún un gobernante. Para resolver ciertos casos de ambigüedad usamos estadísticas de co–ocurrencias de palabras. Estas estadísticas han sido obtenidas previamente de una manera no supervisada, ya sea a partir de grandes corpus de texto, o a través de Internet (a través de un motor de búsqueda como Google). El conjunto de estadísticas de co–ocurrencias de uso conforman una base de datos de Preferencias de Selección.

Para evaluar este sistema, desarrollamos un método para convertir un estándar existente, de un formato de constituyentes a un formato de dependencias. Adicionalmente, cada uno de los módulos del sistema (Adquisición de Preferencias de Selección, Desambiguación de Sintagma Preposicional) se evalúa de una forma separada e independiente para verificar su correcto funcionamiento. Finalmente, presentamos algunas aplicaciones de nuestro sistema: Desambiguación de sentidos de palabras y Estaganografía lingüística.

Palabras clave: análisis de dependencias, desambiguación de frase preposicional, conversión de constituyentes a dependencias, reglas heurísticas, analizador sintáctico híbrido, preferencias de selección.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

References

1. Agirre E., D. Martínez. Unsupervised WSD based on automatically retrieved examples: The importance of bias. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, Barcelona, Spain, 2004.        [ Links ]

2. Agirre, E. D. Martinez. Learning class–to–class selectional preferences. In: Proceedings of the Workshop Computational Natural Language Learning (CoNLL–2001), Toulousse, France, 6–7 july, 2001.        [ Links ]

3. Agirre, E., D. Martinez. Integrating selectional preferences in WordNet. In: Proceedings of the first International WordNet Conference, Mysore, India, 21–25 January, 2002.        [ Links ]

4. Apresyan, Yuri D., Igor Boguslavski, Leonid Iomdin, Alexandr Lazurski, Nikolaj Pertsov, Vladimir Sannikov, Leonid Tsinman. Linguistic Support of the ETAP–2 System (in Russian). Moscow, Nauka, 1989.        [ Links ]

5. Bolshakov, Igor A. A Method of Linguistic Steganography Based on Collocationally–Verified Synonymy. Information Hiding 2004, Lecture Notes in Computer Science, 3200 Springer–Verlag, 2004, pp. 180–191.        [ Links ]

6. Bolshakov, Igor A., Alexander Gelbukh. Lexical functions in Spanish. Proc. CIC–98, Simposium Internacional de Computación, Mexico, pp. 383–395; www.gelbukh.com/CV/Publications/1998/CIC–98–Lexical–Functions.htm, 1998.        [ Links ]

7. Bolshakov, Igor A., Alexander Gelbukh. A Very Large Database of Collocations and Semantic Links. Proc. NLDB–2000: 5th Intern. Conf. on Applications of Natural Language to Information Systems, France, Lecture Notes in Computer Science N 1959, Springer–Verlag, 2000, pp. 103–114.        [ Links ]

8. Bolshakov, Igor A., Alexander Gelbukh. On Detection of Malapropisms by Multistage Collocation Testing. NLDB–2003, 8th Int. Conf. on Application of Natural Language to Information Systems. Bonner Köllen Verlag, 2003, pp. 28–41.        [ Links ]

9. Brants, T., TnT: A Statistical Part–of–Speech Tagger. In Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA, 2000.        [ Links ]

10. Brants, Thorsten. TNT–A Statistical Part–of–Speech Tagger. In: Proc. ANLP–2000, 6th Applied NLP Conference, Seattle, 2000.        [ Links ]

11. Brill, Eric, Philip Resnik. A Rule–Based Approach to Prepositional Phrase Attachment Disambiguation, In Proceedings of COLING–1994, 1994.        [ Links ]

12. Briscoe, Ted. John Carroll, Jonathan Graham and Ann Copestake. Relational evaluation schemes. In: Procs. of the Beyond PARSEVAL Workshop, 3rd International Conference on Language Resources and Evaluation, Las Palmas, Gran Canaria, 2002, 4–8.        [ Links ]

13. Calvo, Hiram, Alexander Gelbukh. Extracting Semantic Categories of Nouns for Syntactic Disambiguation from Human–Oriented Explanatory Dictionaries, In Computational Linguistics and Intelligent Text Processing, Springer LNCS 2945, 2004.        [ Links ]

14. Calvo, Hiram, Alexander Gelbukh. Improving Prepositional Phrase Attachment Disambiguation Using the Web as Corpus, In A. Sanfeliu and J. Shulcloper (Eds.) Progress in Pattern Recognition, Springer LNCS 2905, 2003, pp. 604–610        [ Links ]

15. Calvo, Hiram, Alexander Gelbukh. Natural Language Interface Framework for Spatial Object Composition Systems. Procesamiento de Lenguaje Natural 31, 2003.        [ Links ]

16. Calvo, Hiram, Alexander Gelbukh. Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation. In: Proc. NLDB–2004, Lecture Notes in Computer Science 3136, 2004, pp. 207–216.        [ Links ]

17. Calvo, Hiram. Alexander Gelbukh, Adam Kilgarriff. Distributional Thesaurus versus WordNet: A Comparison of Backoff Techniques for Unsupervised PP Attachment. In: Computational Linguistics and Intelligent Text Processing (CICLing–2005). LNCS 3406, Springer–Verlag, 2005, pp. 177–188.        [ Links ]

18. Carreras, Xavier, Isaac Chao, Lluis Padró, Muntsa Padró. FreeLing: An Open–Source Suite of Language Analyzers. Proc. 4th Intern. Conf. on Language Resources and Evaluation (LREC–04), 2004, Portugal.        [ Links ]

19. Carroll, J., D. McCarthy. Word sense disambiguation using automatically acquired verbal preferences. In Computers and the Humanities, 34(1–2), Netherlands, 2000.        [ Links ]

20. Chomsky, Noam. Syntactic Structures. The Hague: Mouton & Co, 1957.        [ Links ]

21. Civit, Montserrat, and Maria Antònia Martí. Estándares de anotación morfosintáctica para el español. Workshop of tools and resources for Spanish and Portuguese. IBERAMIA 04, Mexico, 2004.        [ Links ]

22. Copestake, Ann, Dan Flickinger, Ivan A. Sag. Minimal Recursion Semantics. An introduction. CSLI, Stanford University, 1997.        [ Links ]

23. Debusmann, Ralph, Denys Duchier, Geert–Jan M. Kruijff, Extensible Dependency Grammar: A New Methodology. In: Recent Advances in Dependency Grammar. Proc. of a workshop at COLING–04, Geneve, 2004        [ Links ]

24. Díaz, Isabel, Lidia Moreno, Inmaculada Fuentes, Oscar Pastor. Integrating Natural Language Techniques in OO–Method. In: Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing (CICLing–2005), Lecture Notes in Computer Science 3406, Springer–Verlag, 2005, pp. 560–571.        [ Links ]

25. Dik, Simon C., The Theory of Functional Grammar. Part I: The structure of the clause. Dordrecht, Foris, 1989.        [ Links ]

26. Dirk, Lüdtke, Satoshi Sato. Fast Base NP Chunking with Decision Trees – Experiments on Different POS Tag Settings. In Gelbukh, A. (ed) Computational Linguistics and Intelligent Text Processing, Springer LNCS, 2003, pp. 136–147.        [ Links ]

27. Gelbukh, A., G. Sidorov, L. Chanona. Corpus virtual, virtual: Un diccionario grande de contextos de palabras españolas compilado a través de Internet. In: Julio Gonzalo, Anselmo Peñas, Antonio Ferrández, eds.: Proc. Multilingual Information Access and Natural Language Processing, International Workshop, in IBERAMIA–2002, VII Iberoamerican Conference on Artificial Intelligence, Seville, Spain, November 12–15, 2002, 7–14.        [ Links ]

28. Gelbukh, A., S. Torres, H. Calvo. Transforming a Constituency Treebank into a Dependency Treebank. Submitted to Procesamiento del Lenguaje Natural No. 34, Spain, 2005.        [ Links ]

29. Gelbukh, Alexander, Grigori Sidorov, Francisco Velásquez. Análisis morfológico automático del español a través de generación. Escritos, N 28, 2003, pp. 9–26.        [ Links ]

30. Gladki, A. V. Syntax Structures of Natural Language in Automated Dialogue Systems (in Russian). Moscow, Nauka, 1985.        [ Links ]

31. Kudo, T., Y. Matsumoto. Use of Support Vector Learning for Chunk Identification. In Proceedings of CoNLL–2000 and LLL–2000, Lisbon, Portugal, 2000.        [ Links ]

32. Lara, Luis Fernando. Diccionario del español usual en México. Digital edition. Colegio de México, Center of Linguistic and Literary Studies, 1996.        [ Links ]

33. Mel'cuk, Igor A. Meaning–text models: a recent trend in Soviet linguistics. Annual Review of Anthropology 10, 1981, 27–62.        [ Links ]

34. Mel'cuk, Igor A. Dependency Syntax: Theory and Practice. State U. Press of NY, 1988.        [ Links ]

35. Mel'cuk, Igor A. Lexical Functions: A Tool for the Description of Lexical Relations in the Lexicon. In: L. Wanner (ed.), Lexical Functions in Lexicography and Natural Language Processing, Amsterdam/Philadelphia: Benjamins, 1996, 37–102.        [ Links ]

36. Miller, G. WordNet: An on–line lexical database, In International Journal of Lexicography, 3(4), December 1990, pp. 235–312.        [ Links ]

37. Monedero, J., González, J. Goñi, C. Iglesias, A. Nieto. Obtención automática de marcos de subcategorización verbal a partir de texto etiquetado: el sistema SOAMAS. In Actas del XI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural SEPLN 95, Bilbao, Spain, 1995, 241–254.        [ Links ]

38. Montes–y–Gómez, Manuel, Alexander F. Gelbukh, Aurelio López–López. Text Mining at Detail Level Using Conceptual Graphs. In: Uta Priss et al. (Eds.): Conceptual Structures: Integration and Interfaces, 10th Intern. Conf. on Conceptual Structures, ICCS–2002, Bulgaria. LNCS 2393, Springer–Verlag, 2002, pp. 122–136.        [ Links ]

39. Montes–y–Gómez, Manuel, Aurelio López–López, and Alexander Gelbukh. Information Retrieval with Conceptual Graph Matching. Proc. DEXA–2000, 11th Intern. Conf. DEXA, England, LNCS 1873, SpringerVerlag, 2000, pp. 312–321.        [ Links ]

40. Morales–Carrasco, R., A. Gelbukh. Evaluation of TnT Tagger for Spanish. In Proc. Fourth Mexican International Conference on Computer Science, Tlaxcala, Mexico, September 8–12, 2003.        [ Links ]

41. Pollard, Carl, and Ivan Sag. Head–Driven Phrase Structure Grammar. University of Chicago Press, Chicago, IL and London, UK, 1994.        [ Links ]

42. Prescher, Detlef, Stefan Riezler, and Mats Rooth. Using a probabilistic class–based lexicon for lexical ambiguity resolution. In Proceedings of the 18th International Conference on Computational Linguistics, Saarland University, Saarbrücken, Germany, 2000.        [ Links ]

43. Ratnaparkhi, Adwait, Jeff Reynar, and Salim Roukos. A Maximum Entropy Model for Prepositional Phrase Attachment. In Proceedings of the ARPA Human Language Technology Workshop, 1994, pp. 250–255.        [ Links ]

44. Resnik, P. Selectional Constraints: An Information–Theoretic Model and its Computational Realization, Cognition, 61, November, 1996, 127–159.        [ Links ]

45. Resnik, P. Selectional preference and sense disambiguation, ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C., USA, April 4–5, 1997.        [ Links ]

46. Resnik, P. Selection and Information: A Class–Based Approach to Lexical Relationships. Ph.D. Thesis, University of Pennsylvania, December, 1993.        [ Links ]

47. Sag, Ivan, Tom Wasow, and Emily M. Bender. Syntactic Theory. A Formal Introduction (2nd Edition). CSLI Publications, Stanford, CA, 2003        [ Links ]

48. Sebastián, N., M. A. Martí, M. F. Carreiras, and F. Cuestos. LEXESP, léxico informatizado del español, Edicions de la Universitat de Barcelona, 2000.        [ Links ]

49. Sowa, John F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Addison–Wesley Publishing Co., Reading, MA, 1984.        [ Links ]

50. Steele, James (ed.). Meaning–Text Theory. Linguistics, Lexicography, and Implications. Ottawa: Univ. of Ottawa Press, 1990.        [ Links ]

51. Suárez, A., M. Palomar. A Maximum Entropy–based Word Sense Disambiguation System. In: Hsin–Hsi Chen and Chin–Yew Lin, eds.: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, Taipei, Taiwan, vol. 2, 2002, 960–966.        [ Links ]

52. Tapanainen, Pasi. Parsing in two frameworks: finite–state and functional dependency grammar. Academic Dissertation. University of Helsinki, Language Technology, Department of General Linguistics, Faculty of Arts, 1999.        [ Links ]

53. Tesnière, Lucien. Eléments de syntaxe structurale. Paris: Librairie Klincksieck, 1959.        [ Links ]

54. Volk, Martin. Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In Proceeding of Corpus Linguistics 2001. Lancaster, 2001.        [ Links ]

55. Weinreich, Uriel. Explorations in Semantic Theory, Mouton, The Hague, 1972.        [ Links ]

56. Yarowsky, D., Hierarchical decision lists for word sense disambiguation. In Computers and the Humanities, 34(2), 2000, 179–186.        [ Links ]

57. Yarowsky, D., S. Cucerzan, R. Florian, C. Schafer, R. Wicentowski. The Johns Hopkins SENSEVAL–2 System Description. In: Preiss and Yarowsky, eds.: The Proceedings of SENSEVAL–2: Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse, France, 2001, 163–166.        [ Links ]

58. Yuret, Deniz. Discovery of Linguistic Relations Using Lexical Attraction, PhD thesis, MIT, 1998.        [ Links ]