SciELO - Scientific Electronic Library Online

 
 número49MultiSearchBP: Entorno para búsqueda y agrupación de modelos de procesos de negocioUna propuesta para incorporar más semántica de los modelos al código generado índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Polibits

versión On-line ISSN 1870-9044

Polibits  no.49 México ene./jun. 2014

 

Combining Active and Ensemble Learning for Efficient Classification of Web Documents

 

Steffen Schnitzer1, Sebastian Schmidt1, Christoph Rensing1, and Bettina Harriehausen-Miihlbauer2

 

1 Multimedia Communications Lab, Technische Universitat Darmstadt, Germany (e-mail: Steffen.Schnitzer@kom.tu-darmstadt.de, Sebastian.Schmidt@kom.tu-darmstadt.de, Christoph.Rensing@kom.tu-darmstadt.de).

2 University of Applied Sciences, Darmstadt, Germany (e-mail: Bettina.Harriehausen@h-da.de).

 

Manuscript received on December 17, 2013
Accepted for publication on February 6, 2014.

 

Abstract

Classification of text remains a challenge. Most machine learning based approaches require many manually annotated training instances for a reasonable accuracy. In this article we present an approach that minimizes the human annotation effort by interactively incorporating human annotators into the training process via active learning of an ensemble learner. By passing only ambiguous instances to the human annotators the effort is reduced while maintaining a very good accuracy. Since the feedback is only used to train an additional classifier and not for re-training the whole ensemble, the computational complexity is kept relatively low.

Key words: Text classification, active learning, user feedback, ensemble learning.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

Acknowledgements

The work presented in this paper was partly funded by the German Federal Ministry of Education and Research (BMBF) under grant no. 01IS12054 and partially funded in the framework of Hessen Modell Projekte, financed with funds of LOEWE-State Offensive for the Development of Scientific and Economic Excellence (HA project no. 292/11-37). The responsibility for the contents of this publication lies with the authors. We thank kimeta GmbH for the essential help assisting with building the evaluation corpus.

 

References

[1] Netcraft, "November 2013 web server survey," http://news.netcraft.com/archives/2013/11/01/november-2013-web-server-survey.html, year 2013, [Online; accessed 18-November-2013]         [ Links ].

[2] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval. Cambridge University Press Cambridge, 2008, vol. 1.         [ Links ]

[3] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Information Processing Management, vol. 24, no. 5, pp. 513-523, 1988.         [ Links ]

[4] T. Joachims, "A statistical learning learning model of text classification for support vector machines," in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 128-136. [Online]. Available: http://dl.acm.org/citation.cfm?id=383974        [ Links ]

[5] N. Tripathi, M. Oakes, and S. Wermter, "A fast subspace text categorization method using parallel classifiers," in Computational Linguistics and Intelligent Text Processing. Springer, 2012, pp. 132-143. [Online]. Available: http://link.springer.com/chapter/10.1007/978-3-642-28601-8_12        [ Links ]

[6] F. Fukumoto, Y. Suzuki, and S. Matsuyoshi, "Text classification from positive and unlabeled data using misclassified data correction," in Proceedings of the the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), 2013, pp. 474—478.         [ Links ]

[7] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2011.         [ Links ]

[8] C. C. Aggarwal, Mining text data. Springer, 2012.         [ Links ]

[9] B. Settles, M. Craven, and L. Friedland, "Active learning with real annotation costs," in Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 2008, pp. 1-10. [Online]. Available: http://dl.acm.org/citation.cfm?id=1557119        [ Links ]

[10] Y. Fu, X. Zhu, and B. Li, "A survey on instance selection for active learning," Knowledge and Information Systems, vol. 35, no. 2, pp. 249-283, May 2013. [Online]. Available: http://link.springer.com/article/10.1007/s10115-012-0507-8        [ Links ]

[11] B. Yang, J.-T. Sun, T. Wang, and Z. Chen, "Effective multi-label active learning for text classification," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '09. New York, NY, USA: ACM, 2009, pp. 917-926. [Online]. Available: http://doi.acm.org/10.1145/1557019.1557119        [ Links ]

[12] B. Settles, "Active learning literature survey," University of Wisconsin on Active Learning, Madison, 2010.         [ Links ]

[13] J. Zhu and M. Ma, "Uncertainty-based active learning with instability estimation for text classification," ACM Trans. Speech Lang. Process., vol. 8, no. 4, pp. 5:1-5:21, Feb. 2012. [Online]. Available: http://doi.acm.org/10.1145/2093153.2093154        [ Links ]

[14] X. Li and C. G. Snoek, "Classifying tag relevance with relevant positive and negative examples," in Proceedings of the 21st ACM International Conference on Multimedia, ser. MM '13. New York, NY, USA: ACM, 2013, pp. 485-488. [Online]. Available: http://doi.acm.org/10.1145/2502081.2502129Links ]acm.org/10.1145/2502081.2502129">

[15] S. Schnitzer, "Effective classification of ambiguous web documents incorporating human feedback efficiently," Master's thesis, University of Applied Sciences Darmstadt, Faculty of Computer Science, Darmstadt, Germany, 2013.         [ Links ]

[16] J. Platt, "Fast training of support vector machines using sequential minimal optimization," in Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds. MIT Press, 1998. [Online]. Available: http://dl.acm.org/citation.cfm?id=299105        [ Links ]

 

Note

The first two authors contributed equally to this work.
1 http://www.google.com
2
http://www.bing.com

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons