SciELO - Scientific Electronic Library Online

 
 issue48A POS Tagger for Social Media Texts Trained on Web CommentsMore Effective Boilerplate Removal-the GoldMiner Algorithm author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Polibits

On-line version ISSN 1870-9044

Polibits  n.48 México Jul./Dec. 2013

 

N-gramas sintácticos no-continuos

 

Non-continuous Syntactic N-grams

 

Grigori Sidorov

 

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Av. Juan de Dios Bátiz, s/n, esq. Othón de Mendizábal, Zacatenco, 07738, México DF, México (e-mail: www.cic.ipn.mx/-sidorov).

 

Manuscrito recibido 12 de junio de 2013.
Manuscrito aceptado para su publicación 23 de septiembre de 2013.

 

Resumen

En este artículo presentamos el concepto de los n-gramas sintácticos no-continuos. En nuestros trabajos previos hemos introducido un concepto general de los n-gramas sintácticos, es decir, los n-gramas que se construyen siguiendo las rutas en un árbol sintáctico. Su gran ventaja consiste en que permiten introducir información puramente lingüística (sintáctica) en los métodos computacionales de aprendizaje automático. Su desventaja está relacionada con la necesidad de realizar el análisis sintáctico automático previo. También hemos demostrado que la aplicación de los n-gramas sintácticos en la tarea de atribución de autoría da mejores resultados que el uso de los n-gramas tradicionales. Sin embargo, en dichos trabajos sólo hemos considerado los n-gramas sintácticos continuos, es decir, durante su construcción no se permiten bifurcaciones en las rutas sintácticas. En este artículo, estamos proponiendo a quitar esta limitación, y de esa manera considerar todos los sub-árboles de longitud n de un árbol sintáctico como los n-gramas sintácticos no-continuos. Cabe mencionar que los n-gramas sintácticos continuos son un caso particular de los n-gramas sintácticos no-continuos. El trabajo futuro debe mostrar qué tipo de n-gramas es más útil y para qué tareas de PLN. Se propone la manera formal de escribir un n-grama sintáctico no-continuo usando paréntesis y comas, por ejemplo, "a b [c [d, e], f]". También presentamos en este artículo ejemplos de construcción de n-gramas sintácticos no-continuos para los árboles sintácticos obtenidos usando FreeLing y el parser de Stanford.

Palabras clave: Modelo de espacio vectorial, n-gramas, n-gramas sintácticos continuos, n-gramas sintácticos no-continuos.

 

Abstract

In this paper, we present the concept of non-continuous syntactic n-grams. In our previous works we introduced the general concept of syntactic n-grams, i.e., n-grams that are constructed by following paths in syntactic trees. Their great advantage is that they allow introducing of the merely linguistic (syntactic) information into machine learning methods. Certain disadvantage is that previous parsing is required. We also proved that their application in the authorship attribution task gives better results than using traditional n-grams. Still, in those works we considered only continuous syntactic n-grams, i.e., the paths in syntactic trees are not allowed to have bifurcations. In this paper, we propose to remove this limitation, so we consider all sub-trees of length « of a syntactic tree as non-continuous syntactic n-grams. Note that continuous syntactic n-grams are the particular case of non-continuous syntactic n-grams. Further research should show which n-grams are more useful and in which NLP tasks. We also propose a formal manner of writing down (representing) non-continuous syntactic n-grams using parenthesis and commas, for example, "a b [c [d, e], f]. In this paper, we also present examples of construction of non-continuous syntactic n-grams on the basis of the syntactic tree of the FreeLing and the Stanford parser.

Key words: Vector space model, n-grams, continuous syntactic n-grams, non-continuous syntactic n-grams.

  

DESCARGAR ARTÍCULO EN FORMATO PDF

 

Agradecimientos

Trabajo realizado con el apoyo de gobierno de la Ciudad de México (proyecto ICYT-DF PICCO10-120), el apoyo parcial del gobierno de México (CONACYT, SNI) e Instituto Politécnico Nacional, México (proyectos SIP 20120418, 20131441,20131702; COFAA), proyecto FP7-PEOPLE-2010-IRSES: Web Information Quality - Evaluation Initiative (WIQ-EI) European Commission project 269180. Agradezco a A. Gelbukh por su ayuda y sugerencias fructíferas.

 

Referencias

[1] I. A. Bolshakov, A. Gelbukh. Computational linguistics: models, resources, applications. 187 pp, 2004.         [ Links ]

[2] C. Manning, H. Schütze, "Foundations of Statistical Natural Language Processing," MIT Press, Cambridge, MA, 1999.         [ Links ]

[3] A. Gelbukh, G. Sidorov, "Procesamiento automático del español con enfoque en recursos léxicos grandes," IPN, 307 p., 2010.         [ Links ]

[4] G. Sidorov, "N-gramas sintácticos y su uso en la lingüistica computacional," Vectores de investigación, 6(6), 15 p., 2013.         [ Links ]

[5] J.A. Reyes, A. Montes, J.G. González, D.E. Pinto, "Clasificación de roles semánticos usando características sintácticas, semánticas y contextúales," Computación y sistemas, 17(2), pp. 263-272, 2013.         [ Links ]

[6] A. Gelbukh. "Using a semantic network for lexical and syntactical disambiguation," Proc. CIC-97, Simposium Internacional de Computación, Mexico, pp. 352-366, 1997.         [ Links ]

[7] Y. Ledeneva, A. Gelbukh, R.A. Garcia-Hernández. "Terms Derived from Frequent Sequences for Extractive Text Summarization," Proc. CICLing 2008, Lecture Notes in Computer Science, N 4919, pp. 593-604, 2008.         [ Links ]

[8] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, "Syntactic Dependency-based N-grams as Classification Features," LNA1, 7630, pp. 1-11, 2012.         [ Links ]

[9] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, "Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification," LACS, 7816 (Proc. of CICLing), pp. 1324(2013)        [ Links ]

[10] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, "Syntactic N-grams as Machine Learning Features for Natural Language Processing," Expert Systems with Applications, in press, 8 p., 2013.         [ Links ]

[11] H. Calvo, J.o. Juarez Gambino, A. Gelbukh, K. Inui. "Dependency Syntax Analysis using Grammar Induction and a Lexical Categories Precedence System," Proc. of CICLing 2011, Lecture Notes in Computer Science, N 6608, pp. 109-120 , 2011.         [ Links ]

[12] M. Khalilov, J.A.R. Fonollosa, "N-gram-based Statistical Machine Translation versus Syntax Augmented Machine Translation: comparison and system combination," Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 424-432, 2009.         [ Links ]

[13] N. Habash, "The Use of a Structural N-gram Language Model in Generation-Heavy Hybrid Machine Translation," LNCS, 3123, pp. 6169, 2004.         [ Links ]

[14] A. Agarwal, F. Biads, K.R. McKeown, "Contextual Phrase-Level Polarity Analysis using Lexical Affect Scoring and Syntactic N-grams," Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pp. 24-32, 2009.         [ Links ]

[15] A. Gelbukh. "Syntactic disambiguation with weighted extended subcategorization frames," Proc. PACLING-99, Pacific Association for Computational Linguistics, Waterloo, Canada, August 25-28, pp. 244-249, 1999.         [ Links ]

[16] A. Gelbukh, I. Bolshakov, S. Galicia-Haro. "Statistics of parsing errors can help syntactic disambiguation," Proc. CIC-98, Simposium Internacional de Computación, November 11-13, Mexico City, pp. 405-515, 1998.         [ Links ]

[17] S. Pado y M. Lapata, "Dependency-based construction of semantic space models," Computational Linguistics, 33(2), pp. 161-199, 2007.         [ Links ]

[18] A. Gelbukh. "Natural language processing: Perspective of CIC-IPN," International Conference on Advances in Computing, Communications and Informatics (ICACCI 2013), IEEE Conference Publications, pp. 2112-2121,2013.         [ Links ]

[19] S.N. Galicia-Haro, A. Gelbukh, LA. Bolshakov. "Acquiring syntactic information for a government pattern dictionary from large text corpora," IEEE International Workshop on Natural Language Processing and Knowledge Engineering, NLPKE 2001 at International IEEE SMC-2001 Conference: Systems, Man, and Cybernetics. Tucson, USA, October 7-10, IEEE, pp. 536-542, 2001.         [ Links ]

[20] A. Gelbukh, I. Bolshakov, S. Galicia Haro. "Automatic Learning of a Syntactical Government Patterns Dictionary from Web-Retrieved Texts," Proc. International Conference on Automatic Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, USA, June 11-13, pp. 261-267, 1998.         [ Links ]

[21] M. Koppel, J. Schler, S. Argamon, "Authorship attribution in the wild," Language Resources and Evaluation 45(1), pp. 83-94, 2011.         [ Links ]

[22] X. Carreras, I. Chao, L. Padró, M. Padró, "FreeLing: An Open-Source Suite of Language Analyzers," Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), 2004.         [ Links ]

[23] L. Padró, E. Stanilovsky, "FreeLing 3.0: Towards Wider Multilinguality," Proceedings of the Language Resources and Evaluation Conference (LREC 2012), ELRA, Turkey, 2012.         [ Links ]

[24] M.C. de Marneffe, B. MacCartney, CD. Manning, "Generating Typed Dependency Parses from Phrase Structure Parses," Proc. of LREC, 2006.         [ Links ]

[25] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, "The WEKA Data Mining Software: An Update," SIGKDD Explorations, 11(1), 2009.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License