Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification

Štajner, Sanja; Drndarević, Biljana; Saggion, Horacio

Services on Demand

Journal

Article

Indicators

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.2 Ciudad de México Apr./Jun. 2013

Artículos

Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification

Eliminación de frases y decisiones de división basadas en corpus para simplificación de textos en español

Sanja Štajner¹, Biljana Drndarević², and Horacio Saggion³

¹ Research Group in Computational Linguistics, University of Wolverhampton, United Kingdom sanjastajner@wlv.ac.uk

² TALN, Department of Information and Communication Technology, Universitat Pompeu Fabra, Spain

³ TALN, Department of Information and Communication Technology, Universitat Pompeu Fabra, Spain

Article received on 13/12/2012
Accepted on 25/01/2013.

Abstract

This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.

Keywords: Spanish text simplification, supervised learning, sentence classification.

Resumen

Este estudio aborda el problema de simplificación automática de textos en español con el fin de hacerlos más accesible a las personas con discapacidades cognitivas. Análisis de corpus de artículos originales y artículos simplificados manualmente se ha realizado para identificar y calificar relevantes operaciones que tienen que ser implementadas en el sistema de simplificación de textos. Luego los artículos se han comparado al nivel de frase y texto mediante extracción automática de características y diversos algoritmos de aprendizaje de máquina para clasificación usando tres distintos grupos de características (frecuencias de partes de oración (POS), información sintáctica y medidas de la complejidad de textos) con el propósito de identificar las características que ayuden a distinguir los documentos originales de sus simples equivalentes. Finalmente, se ha investigado la posibilidad de usar esas características en operaciones de simplificación a nivel de frase (dividir, eliminar y reducir). Clasificación automática de frases originales en las que deben preservarse y las que deben eliminarse ha superado la clasificación anterior sobre el mismo corpus. Las frases guardadas luego se clasificaron en las que se dividen o reducen de manera significativa en su longitud y las que se quedan sin cambios mayores con la F-medida de 0.92. Ambos experimentos se realizaron y compararon sobre dos distintos conjuntos de características: el de todas características y el mejor subconjunto recuperado por el algoritmo de selección de atributos.

Palabras clave: Simplificación de textos en español, aprendizaje supervisado, clasificación de frases.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgments

The research described in this paper was partially funded by the European Commission under the Seventh (FP7 - 2007-2013) Framework Programme for Research and Technological Development (FIRST 287607). This publication [communication] reflect the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. We acknowledge partial support from the following grants: Avanza Competitiveness grant number TSI-020302-2010-84 from the Ministry of Industry, Tourism and Trade, Spain and grant number TIN2012-38584-C06-03 and fellowship RYC-2009-04291 (Programa Ramón y Cajal 2009) from the Spanish Ministry of Economy and Competitiveness.

References

1. Aluísio, S. M., Specia, L., Pardo, T. A. S., Maziero, E., & De Mattos Fortes, R. P. (2008). Towards Brazilian Portuguese automatic text simplification systems. In ACM Symposium on Document Engineering. 240-248. [ Links ]

2. Anula, A. (2007). Tipos de textos, complejidad lingüística y facilicitación lectora. In Actas del Sexto Congreso de Hispanistas de Asia. 45-61. [ Links ]

3. Anula, A. (2008). Lecturas adaptadas a la enseñanza del español como L2: variables lingüísticas para la determinación del nivel de legibilidad. In La evaluación en el aprendizaje y la enseñanza del español como LE/L2, Pastor y Roca (eds.). Alicante, 162-170. [ Links ]

4. Aranzabe, M. J., Díaz De Ilarraza, A., & González, I. (2012). First Approach to Automatic Text Simplification in Basque. In Proceedings of the Natural Language Processing for Improving Textual Accessibility (NLP4ITA) workshop at LREC 2012. [ Links ]

5. Barzilay, R. & Elhadad, N. (2003). Sentence alignment for monolingual comparable corpora. In Proceedings of the EMNLP conference. [ Links ]

6. Bautista, S., Gervás, P., & Madrid, R. (2009). Feasibility Analysis for SemiAutomatic Conversion of Text to Improve Readability. In The Second International Conference on Information and Communication Technologies and Accessibility. [ Links ]

7. Bautista, S., Len, C., Hervs, R., & Gervs, P. (2011). Empirical identification of text simplification strategies for reading-impaired people. In Proceedings of the European Conference for the Advancement of Assistive Technology. [ Links ]

8. Biran, O., Brody, S., & Elhadad, N. (2011). Putting it Simply: a Context-Aware Approach to Lexical Simplificaion. In Proceedings of the ACL. [ Links ]

9. Bott, S., Rello, L., Drndarevic, B., & Saggion, H. (2012). Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), Mumbai, India, 8-15 December. [ Links ]

10. Bott, S. & Saggion, H. (2011). Spanish Text Simplification: An Exploratory Study. Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural, 47. [ Links ]

11. Burstein, J., Shore, J., Sabatini, J., Lee, Y.-W., & Ventura, M. (2007). The Automated Text Adaptation Tool. In HLT-NAACL (Demonstrations). 3-4. [ Links ]

12. Carroll, J., Minnen, G., Canning, Y., Devlin, S., & Tait, J. (1998). Practical Simplification of English Newspaper Text to Assist Aphasic Readers. In Proc. of AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology. 7-10. [ Links ]

13. Caseli, H., Pereira, T., Specia, L., Pardo, T., Gasperin, C., & Aluísio, S. (2009). Building a Brazilian Portuguese parallel corpus of original and simplified texts. In Proceedings of the 10th Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2009), March 01-07, Mexico City. [ Links ]

14. Chandrasekar, R., Doran, D., & Srinivas, B. (1996). Motivations and Methods for Text Simplification. In Proceedings of COLING. 1041-1044. [ Links ]

15. Chomsky, N. (1986). Knowledge of language: its nature, origin, and use. Greenwood Publishing Group, Santa Barbara, California. [ Links ]

16. Cooper, M., Reid, L., Vanderheiden, G., & Caldwell, B. (2010). Understanding wcag 2.0. a guide to understanding and implementing web content accessibility guidelines 2.0. World Wide Web Consortium (W3C). [ Links ]

17. Coster, W. & Kauchak, D. (2011). Learning to Simplify Sentences Using Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 1-9. [ Links ]

18. Coster, W. & Kauchak, D. (2011). Simple English Wikipedia: a new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics. 665-669. [ Links ]

19. De Belder, J., Deschacht, K., & Moens, M.-F. (2010). Lexical simplification. In Proceedings of the 1st International Conference on Interdisciplinary Research on Technology, Education and Communication (LTEC 2010). [ Links ]

20. Devlin, S. & Unthank, G. (2006). Helping aphasic people process online information. In Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility, Assets '06. New York, NY, USA, 225-226. [ Links ]

21. Drndarević, B. & Saggion, H. (2012). Reducing Text Complexity through Automatic Lexical Simplification: an Empirical Study for Spanish. SEPLN Journal, 49. [ Links ]

22. Drndarevic, B., Štajner, S., Bott, S., Bautista, S., & Saggion, H. (2013). Automatic Text Simplication in Spanish: A Comparative Evaluation of Complementing Components. In Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science. Samos, Greece, 24-30 March, 2013. [ Links ]

23. Freyhoff, G., Hess, G., Kerr, L., Menzel, E., Tronbacke, B., & Van Der Veken, K. (1998). Make it Simple, European Guidelines for the Production of Easy-to-Read Information for People with Learning Disability; for authors, editors, information providers, translators and other interested persons. [ Links ]

24. Gasperin, C., Specia, L., Pereira, T., & Aluisio, S. (2009). Learning When to Simplify Sentences for Natural Text Simplification. In Proceedings of the Encontro Nacional de Inteligência Artificial (ENIA-2009), Bento Gonçalves, Brazil. 809-818. [ Links ]

25. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. SIGKDD Explor. Newsl., 11, 10-18. ISSN 1931-0145. doi: http://doi.acm.org/10.1145/1656274.1656278. [ Links ]

26. Hall, M. A. & Smith, L. A. (1998.). Practical feature subset selection for machine learning. In McDonald, C., editor, Computer Science '98 Proceedings of the 21st Australasian Computer Science Conference ACSC 98. Berlin: Springer, 181-191. [ Links ]

27. Ian H. Witten, E. F. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers. [ Links ]

28. Inui, K., Fujita, A., Takahashi, T., Iida, R., & Iwakura, T. (2003). Text Simplification for Reading Assistance: A Project Note. In Proceedings of the 2nd International Workshop on Paraphrasing: Paraphrase Acquisition and Applications. 9-16. [ Links ]

29. Klebanov, B. B., Knight, K., & Marcu, D. (2004). Text simplification for information-seeking applications. In On the Move to Meaningful Internet Systems, Lecture Notes in Computer Science. 735-747. [ Links ]

30. Lal, P. & Ruger, S. (2002). Extract-based Summarization with Simplification. In Proceedings of the ACL 2002 Automatic Summarization / DUC 2002 Workshop. [ Links ]

31. Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The Perceptron Algorithm with Uneven Margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002). 379-386. [ Links ]

32. Medero, J. & Ostendorf, M. (2011). Identifying Targets for Syntactic Simplification. [ Links ]

33. Petersen, S. E. & Ostendorf, M. (2007). Text simplification for language learners: A corpus analysis. In Proceedings of Workshop on Speech and Language TechnologyforEducation. [ Links ]

34. Quinlan, P. (1992). The Oxford Psycholinguistic Database. Oxford University Press. [ Links ]

35. Rybing, J., Smithr, C., & Silvervarg, A. (2010). Towards a Rule Based System for Automatic Simplification of Texts. In Proceedings of the Third Swedish Language TechnologyConference. [ Links ]

36. Saggion, H., E., G.-M., Etayo, E., Anula, A., & Bourg, L. (2011). Text Simplification in Simplext: Making Text More Accessible. SEPLN Journal, 47, 341-342. [ Links ]

37. Siddharthan, A. (2002). An Architecture for a Text Simplification System. In Proceedings of the Language Engineering Conference (LEC 2002). 64-71. [ Links ]

38. Specia, L. (2010). Translating from complex to simplified sentences. In Proceedings of the 9th international conference on Computational Processing of the Portuguese Language. Berlin, Heidelberg. ISBN 3-642-12319-8, 978-3-642-12319-1, 30-39. [ Links ]

39. Štajner, S., Evans, R., Orasan, C., & Mitkov, R. (2012). What Can Readability Measures Really Tell Us About Text Complexity? In Proceedings of the LREC'12 Workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA). Istanbul, Turkey. ISBN 978-2-9517408-7-7. [ Links ]

40. Štajner, S. & Mitkov, R. (2012). Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 12). Istanbul, Turkey. [ Links ]

41. Yatskar, M., B., P., Danescu-Niculescu-Mizil, C., & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the NAACL. 365-368. [ Links ]