Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features

Islam, Zahurul; Mehler, Alexander

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.2 Ciudad de México Apr./Jun. 2013

Artículos

Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features

Clasificación automática de la legibilidad de datos de fuentes múltiples basada en características lingüísticas y de la teoría de información

Zahurul Islam¹ and Alexander Mehler²

¹ AG Texttechnology, Instituí fur Informatik, Goethe-Universitat, Frankfurt, Germany mehler@em.uni-frankfurt.de

² AG Texttechnology, Instituí fur Informatik, Goethe-Universitat, Frankfurt, Germany

Article received on 12/12/2012
Accepted on 16/02/2013.

Abstract

This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well as their linguistic counterparts even if we explore several linguistic levels at once.

Keywords: Text readability, Wikipedia, enthropy, information transmission, evaluation of features.

Resumen

En este trabajo se presenta un clasificador de la legibilidad de textos basado en las características de la teoría de información. El clasificador ha sido desarrollado en base del enfoque lingüístico a la legibilidad usando las características léxicas, sintácticas y semánticas. Para esta evaluación se extrajo un corpus de 645 artículos de Wikipedia, junto con sus evaluaciones de calidad. Se demuestra que las características mencionadas tienen buen desempeño, incluso en el caso cuando se exploran varios niveles lingüísticos a la vez.

Palabras clave: Legibilidad de textos, Wikipedia, entropía, transmisión de información, evaluación de características.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Alan, K. (2001). Natural Language Semantics. Blackwell Publishers Ltd, Oxford. [ Links ]

2. Aluisio, R., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. In NAACL-HLT 2010: The 5th Workshop on Innovative Use of NLP for Building Educational Applications. [ Links ]

3. Barzilay, R. & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 21(3), 285-301. [ Links ]

4. Borst, A. & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947-957. [ Links ]

5. Dale, E. & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27(1), 11-20+28. [ Links ]

6. Dale, E. & Chall, J. S. (1995). Readability Revisited: The New Dale-Chall Readability formula. Brookline Books. [ Links ]

7. Das, D. & Smith, N. A. (2011). Semi-supervised frame-semantic parsing for unknown predicates. In The Annual Meeting of the Association for Computational Linguistics, Portland. [ Links ]

8. Eickhoff, C., Serdyukov, P., & de Vries, A. P. (2011). A combined topical/non-topical approach to identifying web sites for children. In Proceedings of the fourth ACM international conference on Web search and data mining. [ Links ]

9. Feng, L., Elhadad, N., & Huenerfauth, M. (2009). Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the ACL. [ Links ]

10. Feng, L., Janche, M., Huenerfauth, M., & Elhadad, N. (2010). A comparison of features for automatic readability assessment. In The 23rd International Conference on Computational Linguistics (COLING). [ Links ]

11. Fillmore, C. J. (1982). Frame semantis. In Linguistics in the Morning Calm. Hanshin Publishing Co., 111-137. [ Links ]

12. Fillmore, C. J., Johnson, C. R., & Petruck, M. R. (2003). Background to framenet. International Journal of Lexicography, 16(3), 235-250. [ Links ]

13. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005). [ Links ]

14. Genzel, D. & Charniak, E. (2002). Entropy rate constancy in text. In Proceedings of the 40st Meeting of the Association for Computational Linguistics (ACL 2002). [ Links ]

15. Genzel, D. & Charniak, E. (2003). Variation of entropy and parse trees of sentences as a function of the sentence number. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). [ Links ]

16. Gildea, D. & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28(3), 245-288. [ Links ]

17. Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438, 900:901. [ Links ]

18. Gunning, R. (1952). The Technique of clear writing. McGraw-Hill; Fourth Printing Edition. [ Links ]

19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations, 11(1), 10-18. [ Links ]

20. Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2007). Combining lexical and grammatical features to improve readavility measures for first and second language text. In Proceedings of the Human Language Technology Conference. [ Links ]

21. Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (EANL). [ Links ]

22. Holler, A. & Irmen, L. (2007). Empirically assessing the effects of the Right Frontier Constraint. In Branco, A., editor, Anaphora: Analysis, Algorithms and Applications, Lecture Notes in Artificial Intelligence. Springer, Berlin and Heidelberg, 15-27. [ Links ]

23. Islam, Z., Mehler, A., & Rahman, R. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation. (Accepted). [ Links ]

24. Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., Roukos, S., & Welty, C. (2010). Learning to predict readability using diverse linguistic features. In 23rd International Conference on Computational Linguistics (COLING 2010). [ Links ]

25. Keerthi, S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637-649. [ Links ]

26. Kincaid, J., Fishburne, R., Rodegers, R., & Chissom, B. (1975). Derivation of new readability formulas for Navy enlisted personnel. Technical report, US Navy, Branch Report 8-75, Cheif of Naval Traning, Millington, TN. [ Links ]

27. Klein, D. & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (ACL 2003). [ Links ]

28. Klir, G. J. (2005). Uncertainty and Information. Wiley-Interscience. [ Links ]

29. Kornai, A. (2008). Mathematical Linguistics. Springer. [ Links ]

30. Ma, Y., Singh, R., Fosler-Lussier, E., & Lofthus, R. (2012). Comparing human versus automatic feature extraction for fine-grained elementary readability assesment. In NAACL-HLT 2012 Workshop on Predicting and Improving Text Readability for target reader populations. [ Links ]

31. Mann, W. & Thompson, S. (1988). Rhethorical structure theory: Towards a functional theory of text organization. Text, 8(3), 243-281. [ Links ]

32. Mullan, W. (2008). Dairy science and food technology improving your writing using a readability calculator. [ Links ]

33. Petersen, S. E. & Ostendorf, M. (2009). A machine learning approach to reading level assesment. Computer Speech and Language, 23(1), 89-106. [ Links ]

34. Pitler, E. & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). [ Links ]

35. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. MIT Press. [ Links ]

36. Plotkin, J. B. & Nowak, M. A. (2000). Language evolution and information theory. Journal of Theoretical Biology, 205(1), 147-159. [ Links ]

37. Polanyi, L. (1988). A formal model of the structure of discourse. Journal of Pragmatics, 12(56), 601-638. [ Links ]

38. Schwarm, S. E. & Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In The Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics(ACL 2005). [ Links ]

39. Senter, R. & Smith, E. A. (1967). Automated readability index. Technical report, Wright-Patterson Air Force Base. [ Links ]

40. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(1), 379-423. [ Links ]

41. Stoyanov, V., Cardie, C., Gilbert, N., Riloff, E., Buttler, D., & Hysom, D. (2010). Coreference resolution with reconcile. In Conference of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Short Paper.

42. Üstün, B., Melssen, W., & Buydens, L. (2006). Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems, 81(1), 29-40. [ Links ]