SciELO - Scientific Electronic Library Online

vol.17 issue2EditorialLinguistically-driven Selection of Correct Arcs for Dependency Parsing author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand




Related links

  • Have no similar articlesSimilars in SciELO


Computación y Sistemas

Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.2 México Apr./Jun. 2013




Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features


Clasificación automática de la legibilidad de datos de fuentes múltiples basada en características lingüísticas y de la teoría de información


Zahurul Islam1 and Alexander Mehler2


1 AG Texttechnology, Instituí fur Informatik, Goethe-Universitat, Frankfurt, Germany

2 AG Texttechnology, Instituí fur Informatik, Goethe-Universitat, Frankfurt, Germany


Article received on 12/12/2012
Accepted on 16/02/2013.



This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well as their linguistic counterparts even if we explore several linguistic levels at once.

Keywords: Text readability, Wikipedia, enthropy, information transmission, evaluation of features.



En este trabajo se presenta un clasificador de la legibilidad de textos basado en las características de la teoría de información. El clasificador ha sido desarrollado en base del enfoque lingüístico a la legibilidad usando las características léxicas, sintácticas y semánticas. Para esta evaluación se extrajo un corpus de 645 artículos de Wikipedia, junto con sus evaluaciones de calidad. Se demuestra que las características mencionadas tienen buen desempeño, incluso en el caso cuando se exploran varios niveles lingüísticos a la vez.

Palabras clave: Legibilidad de textos, Wikipedia, entropía, transmisión de información, evaluación de características.





1. Alan, K. (2001). Natural Language Semantics. Blackwell Publishers Ltd, Oxford.         [ Links ]

2. Aluisio, R., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. In NAACL-HLT 2010: The 5th Workshop on Innovative Use of NLP for Building Educational Applications.         [ Links ]

3. Barzilay, R. & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 21(3), 285-301.         [ Links ]

4. Borst, A. & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947-957.         [ Links ]

5. Dale, E. & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27(1), 11-20+28.         [ Links ]

6. Dale, E. & Chall, J. S. (1995). Readability Revisited: The New Dale-Chall Readability formula. Brookline Books.         [ Links ]

7. Das, D. & Smith, N. A. (2011). Semi-supervised frame-semantic parsing for unknown predicates. In The Annual Meeting of the Association for Computational Linguistics, Portland.         [ Links ]

8. Eickhoff, C., Serdyukov, P., & de Vries, A. P. (2011). A combined topical/non-topical approach to identifying web sites for children. In Proceedings of the fourth ACM international conference on Web search and data mining.         [ Links ]

9. Feng, L., Elhadad, N., & Huenerfauth, M. (2009). Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the ACL.         [ Links ]

10. Feng, L., Janche, M., Huenerfauth, M., & Elhadad, N. (2010). A comparison of features for automatic readability assessment. In The 23rd International Conference on Computational Linguistics (COLING).         [ Links ]

11. Fillmore, C. J. (1982). Frame semantis. In Linguistics in the Morning Calm. Hanshin Publishing Co., 111-137.         [ Links ]

12. Fillmore, C. J., Johnson, C. R., & Petruck, M. R. (2003). Background to framenet. International Journal of Lexicography, 16(3), 235-250.         [ Links ]

13. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005).         [ Links ]

14. Genzel, D. & Charniak, E. (2002). Entropy rate constancy in text. In Proceedings of the 40st Meeting of the Association for Computational Linguistics (ACL 2002).         [ Links ]

15. Genzel, D. & Charniak, E. (2003). Variation of entropy and parse trees of sentences as a function of the sentence number. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).         [ Links ]

16. Gildea, D. & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28(3), 245-288.         [ Links ]

17. Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438, 900:901.         [ Links ]

18. Gunning, R. (1952). The Technique of clear writing. McGraw-Hill; Fourth Printing Edition.         [ Links ]

19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations, 11(1), 10-18.         [ Links ]

20. Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2007). Combining lexical and grammatical features to improve readavility measures for first and second language text. In Proceedings of the Human Language Technology Conference.         [ Links ]

21. Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (EANL).         [ Links ]

22. Holler, A. & Irmen, L. (2007). Empirically assessing the effects of the Right Frontier Constraint. In Branco, A., editor, Anaphora: Analysis, Algorithms and Applications, Lecture Notes in Artificial Intelligence. Springer, Berlin and Heidelberg, 15-27.         [ Links ]

23. Islam, Z., Mehler, A., & Rahman, R. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation. (Accepted).         [ Links ]

24. Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., Roukos, S., & Welty, C. (2010). Learning to predict readability using diverse linguistic features. In 23rd International Conference on Computational Linguistics (COLING 2010).         [ Links ]

25. Keerthi, S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637-649.         [ Links ]

26. Kincaid, J., Fishburne, R., Rodegers, R., & Chissom, B. (1975). Derivation of new readability formulas for Navy enlisted personnel. Technical report, US Navy, Branch Report 8-75, Cheif of Naval Traning, Millington, TN.         [ Links ]

27. Klein, D. & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (ACL 2003).         [ Links ]

28. Klir, G. J. (2005). Uncertainty and Information. Wiley-Interscience.         [ Links ]

29. Kornai, A. (2008). Mathematical Linguistics. Springer.         [ Links ]

30. Ma, Y., Singh, R., Fosler-Lussier, E., & Lofthus, R. (2012). Comparing human versus automatic feature extraction for fine-grained elementary readability assesment. In NAACL-HLT 2012 Workshop on Predicting and Improving Text Readability for target reader populations.         [ Links ]

31. Mann, W. & Thompson, S. (1988). Rhethorical structure theory: Towards a functional theory of text organization. Text, 8(3), 243-281.         [ Links ]

32. Mullan, W. (2008). Dairy science and food technology improving your writing using a readability calculator.         [ Links ]

33. Petersen, S. E. & Ostendorf, M. (2009). A machine learning approach to reading level assesment. Computer Speech and Language, 23(1), 89-106.         [ Links ]

34. Pitler, E. & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).         [ Links ]

35. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. MIT Press.         [ Links ]

36. Plotkin, J. B. & Nowak, M. A. (2000). Language evolution and information theory. Journal of Theoretical Biology, 205(1), 147-159.         [ Links ]

37. Polanyi, L. (1988). A formal model of the structure of discourse. Journal of Pragmatics, 12(56), 601-638.         [ Links ]

38. Schwarm, S. E. & Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In The Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics(ACL 2005).         [ Links ]

39. Senter, R. & Smith, E. A. (1967). Automated readability index. Technical report, Wright-Patterson Air Force Base.         [ Links ]

40. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(1), 379-423.         [ Links ]

41. Stoyanov, V., Cardie, C., Gilbert, N., Riloff, E., Buttler, D., & Hysom, D. (2010). Coreference resolution with reconcile. In Conference of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Short Paper.

42. Üstün, B., Melssen, W., & Buydens, L. (2006). Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems, 81(1), 29-40.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License