SciELO - Scientific Electronic Library Online

 
 número43Clause Boundary Identification using Classifier and Clause Markers in Urdu LanguageKeywords Identification within Greek URLs índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Artículo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Polibits  no.43 México ene./jun. 2011

 

External Sandhi and its Relevance to Syntactic Treebanking

 

Sudheer KolachinaI, Dipti Misra SharmaII, Phani GaddeIII, Meher VijayIV, Rajeev SangalV, and Akshar Bharati

 

Language Technologies Research Centre, IIIT–Hyderabad, India (e–mail: Isudheer.kpg08@research.iiit.ac.in, IIIphani.gadde@research.iiit.ac.in, IVmehervijay.yeleti@research.iiit.ac.in, IIdipti@mail.iiit.ac.in, Vsangal@mail.iiit.ac.in)

 

Manuscript received October 27, 2010.
Manuscript accepted for publication January 14, 2011.

 

Abstract

Externai sandhi is a linguistic phenomenon which refers to a set of sound changes that occur at word boundaries. These changes are similar to phonological processes such as assimilation and fusion when they apply at the level of prosody, such as in connected speech. External sandhi formation can be orthographically reflected in some languages. External sandhi formation in such languages, causes the occurrence of forms which are morphologically unanalyzable, thus posing a problem for all kind of NLP applications. In this paper, we discuss the implications that this phenomenon has for the syntactic annotation of sentences in Telugu, an Indian language with agglutinative morphology. We describe in detail, how external sandhi formation in Telugu, if not handled prior to dependency annotation, leads either to loss or misrepresentation of syntactic information in the treebank. This phenomenon, we argue, necessitates the introduction of a sandhi splitting stage in the generic annotation pipeline currently being followed for the treebanking of Indian languages. We identify one type of external sandhi widely occurring in the previous version of the Telugu treebank (version 0.2) and manually split all its instances leading to the development of a new version 0.5. We also conduct an experiment with a statistical parser to empirically verify the usefulness of the changes made to the treebank. Comparing the parsing accuracies obtained on versions 0. 2 and 0. 5 of the treebank, we observe that splitting even just one type of external sandhi leads to an increase in the overall parsing accuracies.

Key words: Syntactic treebanks, sandhi.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

REFERENCES

[1] M. Marcus, M. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: The Penn Treebank," Computational linguistics, vol. 19, no. 2, pp. 313–330, 1993.         [ Links ]

[2] A. Bharati, V. Chaitanya, and R. Sangal, Natural language processing: a Paninian perspective. Prentice Hall of India, 1995.         [ Links ]

[3] J. Hajic, "Building a Syntactically Annotated Corpus: The Prague Dependency Treebank," Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová, 1998.         [ Links ]

[4] E. Hajicová, "Prague Dependency Treebank: From Analytic to Tectogrammatical Annotation," Proceedings of TSD'98, pp. 45–50, 1998.         [ Links ]

[5] K. Oflazer, B. Say, D. Hakkani–Tür, and G. Tur, "Building a Turkish treebank," Treebanks: Building and Using Parsed Corpora, vol. 20, pp. 261–277, 2003.         [ Links ]

[6] A. Culotta and J. Sorensen, "Dependency tree kernels for relation extraction," in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004, p. 423.         [ Links ]

[7] F. Reichartz, H. Korte, and G. Paass, "Dependency tree kernels for relation extraction from natural language text," Machine Learning and Knowledge Discovery in Databases, pp. 270–285, 2009.         [ Links ]

[8] R. Begum, S. Husain, A. Dhwaj, D. Sharma, L. Bai, and R. Sangal, "Dependency annotation scheme for Indian languages," Proceedings of IJCNLP–2008, 2008.         [ Links ]

[9] R. Bhatt, B. Narasimhan, M. Palmer, O. Rambow, D. Sharma, and F. Xia, "A multi–representational and multi–layered treebank for hindi/urdu," in Proceedings of the Third Linguistic Annotation Workshop. Association for Computational Linguistics, 2009, pp. 186–189.         [ Links ]

[10] M. Palmer, R. Bhatt, B. Narasimhan, O. Rambow, D. Sharma, and F. Xia, "Hindi Syntax: Annotating Dependency, Lexical Predicate–Argument Structure, and Phrase Structure," in The 7th International Conference on Natural Language Processing, 2009, pp. 14–17.         [ Links ]

[11] A. Bhatia, R. Bhatt, B. Narasimhan, M. Palmer, O. Rambow, D. Sharma, M. Tepper, A. Vaidya, and F. Xia, "Empty Categories in a Hindi Treebank," in LREC–2010, 2010.         [ Links ]

[12] A. Bharati, M. Bhatia, V. Chaitanya, and R. Sangal, "Paninian Grammar Framework Applied to English," South Asian Language Review, 1997.         [ Links ]

[13] A. Vaidya, S. Husain, P. Mannem, and D. Sharma, "A Karaka Based Annotation Scheme for English," Computational Linguistics and Intelligent Text Processing, pp. 41–52, 2009.         [ Links ]

[14] C. Vempaty, V. Naidu, S. Husain, R. Kiran, L. Bai, D. Sharma, and R. Sangal, "Issues in Analyzing Telugu Sentences towards Building a Telugu Treebank," Computational Linguistics and Intelligent Text Processing, pp. 50–59, 2010.         [ Links ]

[15] B. Krishnamurti, The Dravidian languages. Cambridge Univ Press, 2003.         [ Links ]

[16] E. Sapir, Language: An introduction to the study of speech. Dover Publications, 1921.         [ Links ]

[17] J. Greenberg, "A quantitative approach to the morphological typology of language," International Journal of American Linguistics, vol. 26, no. 3, pp. 178–194, 1960.         [ Links ]

[18] S. Husain, "Dependency Parsers for Indian Languages," Proceedings of ICON09 NLP Tools Contest: Indian Language Dependency Parsing, 2009.         [ Links ]

[19] A. Bharati, R. Sangal, D. Sharma, and L. Bai, "Annotating corpora guidelines for pos and chunk annotation for indian languages," 2006, technical report: TR–LTRC–31, LTRC.         [ Links ]

[20] S. Buchholz and E. Marsi, "CoNLL–X shared task on multilingual dependency parsing," in Proceedings of the Tenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2006, pp. 149–164.         [ Links ]

[21] J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret, "The CoNLL 2007 shared task on dependency parsing," in Proceedings of the CoNLL Shared Task Session of EMNLP–CoNLL 2007. Association for Computational Linguistics, 2007.         [ Links ]

[22] V. Mittal, "Automatic Sanskrit segmentizer using finite state transducers," in Proceedings of the ACL 2010 Student Research Workshop. Association for Computational Linguistics, 2010, pp. 85–90.         [ Links ]

[23] A. Bharati, D. Sharma, S. Husain, L. Bai, R. Begum, and R. Sangal, "AnnCorra: Treebanks for Indian languages, guidelines for annotating Hindi dependency treebank," 2009, http://ltrc.iiit.ac.in/MachineTrans/research/tb/DS–guidelines/DS–guidelines–ver2–28–05–09.pdf.         [ Links ]

[24] B. Ambati, S. Husain, J. Nivre, and R. Sangal, "On the role of morphosyntactic features in Hindi dependency parsing," in The First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), 2010, pp. 94–102.         [ Links ]

[25] A. A. Macdonell, A Sanskrit Grammar for students. New Delhi, India: D.K. Printworld (P) Ltd., 1926.         [ Links ]

[26] E. Selkirk, "On prosodic structure and its relation to syntactic structure," Nordic Prosody II: Papers from a Symposium, pp. 111–140, 1981.         [ Links ]

[27] A. Zwicky, "Stranded to and phonological phrasing in english," Linguistics, vol. 20, pp. 3–57, 1982.         [ Links ]

[28] H. Andersen, Sandhi phenomena in the languages of Europe. Mouton de Gruyter, 1986.         [ Links ]

[29] M. Absalom and J. Hajek, "Prosodic phonology and raddoppiamento sintattico: a re–evaluation," in Selected Papers from the 2005 Conference of the Australian Linguistic Society, Melbourne: Monash University. http://www.arts.monash.edu.au/ling/als, 2006.         [ Links ]

[30] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, and E. Marsi, "MaltParser: A language–independent system for data–driven dependency parsing," Natural Language Engineering, vol. 13, no. 02, pp. 95–135, 2007.         [ Links ]