SciELO - Scientific Electronic Library Online

vol.18 issue2Efficiently Finding the Optimum Number of Clusters in a Dataset with a New Hybrid Cellular Evolutionary AlgorithmA Gaussian Selection Method for Speaker Verification with Short Utterances author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand




Related links

  • Have no similar articlesSimilars in SciELO


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.2 México Apr./Jun. 2014 

Artículos regulares


Unsupervised Learning for Syntactic Disambiguation


Aprendizaje no supervisado para la desambiguación sintáctica


Alexander Gelbukh


Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico.



We present a methodology framework for syntactic disambiguation in natural language texts. The method takes advantage of an existing manually compiled non-probabilistic and non-lexicalized grammar, and turns it into a probabilistic lexicalized grammar by automatically learning a kind of subcategorization frames or selectional preferences for all words observed in the training corpus. The dictionary of subcategorization frames or selectional preferences obtained in the training process can be subsequently used for syntactic disambiguation of new unseen texts. The learning process is unsupervised and requires no manual markup. The learning algorithm proposed in this paper can take advantage of any existing disambiguation method, including linguistically motivated methods of filtering or weighting competing alternative parse trees or syntactic relations, thus allowing for integration of linguistic knowledge and unsupervised machine learning.

Keywords: Natural language processing, syntactic parsing, syntactic disambiguation, unsupervised machine learning.



Se presenta un marco metodológico para la desambiguación sintáctica de textos en lenguaje natural. El método se aprovecha de una gramática no probabilística y no lexicalizada existente compilada manualmente, y la convierte en una gramática lexicalizada probabilística a través del aprendizaje automático de una especie de los marcos de subcategorización o preferencias de selección para todas las palabras observadas en el corpus de entrenamiento. El diccionario de los marcos de subcategorización o preferencias de selección, obtenido en el proceso de entrenamiento, se puede utilizar posteriormente para la desambiguación sintáctica de nuevos textos no vistos previamente por el algoritmo. El proceso de aprendizaje es no supervisado y no requiere de marcaje manual alguno. El algoritmo de aprendizaje propuesto en este artículo se puede aprovechar de cualquier método de desambiguación existente, incluyendo métodos lingüísticamente motivados, para la filtración o ponderación de los árboles sintácticos alternativos o relaciones sintácticas alternativas, lo que permite la integración del conocimiento lingüístico y el aprendizaje automático no supervisado.

Palabras clave: Procesamiento del lenguaje natural, análisis sintáctico, desambiguación sintáctica, aprendizaje automático no supervisado.





This work was partially supported by the Government of Mexico via the Instituto Politécnico Nacional grant SIP 20144534 and SNI, and the European Union via the European Commission project 269180 FP7-PEOPLE-2010-IRSES: Web Information Quality-Evaluation Initiative (WIQ-EI).



1. Alfared, R. & Béchet, D. (2012). POS taggers and dependency parsing. International Journal of Computational Linguistics and Applications, 3(2), 107-122.         [ Links ]

2. Allen, J. (1995). Natural language understanding. The Benjamin/Cummings Publishing Company, Inc.         [ Links ]

3. Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3, 1-8.         [ Links ]

4. Benson, M., Benson, E., & Ilson, R. (1986). The BBI Combinatory dictionary of English. John Benjamins Publishing Co.         [ Links ]

5. Bolshakov I.A., Gelbukh, A., & Galicia-Haro, S. (1998). Simulation in linguistics: assessing and tuning text analysis methods with quasi-text generators. International workshop on computational linguistics and its applications, Dialogue-98, Khazan, Russia.         [ Links ]

6. Bolshakov, I.A., Gelbukh, A., Galicia Haro, S., & Orozco Guzmán, M. (1998). Government patterns of 670 Spanish verbs. Technical report, Serie Roja, N 35. CIC, IPN, 1998, 65 pp.         [ Links ]

7. Bolshakov I.A., Cassidy, P.J., & Gelbukh, A. (1995). CrossLexica: a dictionary of word combinations and a thesaurus of Russian (in Russian). International workshop on computational linguistics and its applications, Dialogue-95, Khazan, Russia.         [ Links ]

8. Brill, E., & Resnik, P. (1994). A rule-based approach to prepositional phrase attachment disambiguation. ACL, Kyoto, Japan.         [ Links ]

9. Castro-Sánchez, N.A., & Sidorov, G. (2010). Analysis of definitions of verbs in an explanatory dictionary for automatic extraction of actants based on detection of patterns. Lecture Notes in Computer Science, 6177, 233-239.         [ Links ]

10. Church, K., & Patil, R. (1982). Coping with syntactic ambiguity, or how to put the block in the box on the table. American Journal of Computational Linguistics, 8(3-4), 139-149.         [ Links ]

11. Gelbukh, A. (2012). Ontology-based semantic relatedness measures: Applications and calculation. Research in Computing Science, 47, 117-138.         [ Links ]

12. Gelbukh, A., Bolshakov, I.A., & Galicia-Haro, S. (1998). Statistics of parsing errors can help syntactic disambiguation. CIC-98, Simposium Internacional de Computación, Mexico, 405-515.         [ Links ]

13. Gelbukh, A., Sidorov, G., & Velásquez, F. (2003). Análisis morfológico automático del español a través de generación. Escritos, 28, 9-26.         [ Links ]

14. Gelbukh, A., Sidorov, G., Galicia Haro, S., & Bolshakov, I.A. (2002). Environment for development of a natural language syntactic analyzer. Acta Academia, 206-213.         [ Links ]

15. Elworthy, D. (1994). Does Baum-Welsh re-estimation help taggers? Fourth Conference on Applied Natural Language Processing, Germany.         [ Links ]

16. Mel'čuk, I.A. (1974). An experience of the theory of Meaning ⇔ Text models (in Russian). Nauka, Moscow.         [ Links ]

17. Merlo, P., Crocker, M., & Berthouzoz, C. (1997). Attaching multiple prepositional phrases: Generalized backed-off estimation. EMNLP-2, Brown University Providence, Rhode Island, USA.         [ Links ]

18. Pakray, P., Poria, S., Gelbukh, A., & Bandyopadhyay, S. (2011). Semantic textual entailment recognition using UNL. Polibits, 43, 2327.         [ Links ]

19. Pereira, F., & Schabes, Y. (1992). Inside-outside reestimation from partially bracketed corpora. ACL, University of Delaware, Newark, Delaware, USA.         [ Links ]

20. Poria, S., Agarwal, B., Gelbukh, A., Hussain, A., & Howard, N. (2014). Dependency-based semantic parsing for concept-level text analysis. 15th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2014. Lecture Notes in Computer Science, 8403, 113-127.         [ Links ]

21. Poria, S., Cambria, E., Winterstein, G., & Huang, G.-B. (2014). Sentic patterns: Dependency-based rules for concept-level sentiment analysis. Knowledge-Based Systems, in press.         [ Links ]

22. Poria, S., Gelbukh, A., Agarwal, B., Cambria, E., & Howard, N. (2014). Sentic Demo: A hybrid concept-level aspect-based sentiment analysis toolkit. ESWC 2014, Crete, Greece.         [ Links ]

23. Poria, S., Gelbukh, A., Cambria, E., Hussain, A., & Huang, G.-B. (2014). EmoSenticSpace: A novel framework for affective common-sense reasoning. Knowledge-Based Systems, in press.         [ Links ]

24. Poria, S., Gelbukh, A., Hussain, A., Das, D., & Bandopadhyay, S. (2013). Enhanced SenticNet with affective labels for concept-based opinion mining. IEEE Intelligent Systems, 28(2), 31-38.         [ Links ]

25. Sidorov, G. (2013). Non-continuous syntactic n-grams. Polibits, 48, 67-75.         [ Links ]

26. Sidorov, G. (2013). Syntactic Dependency Based N-grams in rule based automatic English as second language grammar correction. International Journal of Computational Linguistics and Applications, 4(2), 169-188.         [ Links ]

27. Sidorov, G. (2013). Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, 166 p.         [ Links ]

28. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernández, L. (2012). Syntactic Dependency-based n-grams as classification features. Lecture Notes in Artificial Intelligence, 7630, 1 -11.         [ Links ]

29. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernández, L. (2012). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853-860.         [ Links ]

30. Steel, J. (ed.). (1990). Meaning - Text Theory. Linguistics, lexicography, and implications. University of Ottawa press.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License