Servicios Personalizados
Revista
Articulo
Indicadores
- Citado por SciELO
- Accesos
Links relacionados
- Similares en SciELO
Compartir
Polibits
versión On-line ISSN 1870-9044
Polibits no.48 México jul./dic. 2013
More Effective Boilerplate Removalthe GoldMiner Algorithm
István Endrédy, Attila Novák
The authors are with the MTA-PPKE Language Technology Research Group and Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, 50/a Prater street, 1083 Budapest, Hungary (e-mail: endredy.istvan.gergely@itk.ppke.hu, novak.attila@itk.ppke.hu).
Manuscript received on July 31, 2013;
accepted for publication on September 30, 2013.
Abstract
The ever-increasing web is an important source for building large-scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web-based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages.
Key words: Corpus building, boilerplate removal, the web as corpus.
DESCARGAR ARTÍCULO EN FORMATO PDF
Acknowledgments
This research was partially supported by the project grants TÁMOP-4.2.1./B-11/2-KMR-2011-0002 and TÁMOP-4.2.2./ B-10/1-2010-0014.
References
[1] A. Finn, N. Kushmerick, and B. Smyth, "Fact or fiction: Content classification for digital libraries," in DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001. [ Links ]
[2] C. Kohlschtitter, P. Fankhauser, and W. Nejdl, "Boilerplate detection using shallow text features," in Proceedings of the third ACM international conference on Web search and data mining, ser. WSDM '10. New York, NY, USA: ACM, 2010, pp. 441-450. [Online], Available: http://doi.acm.org/10.1145/1718487.1718542 [ Links ]
[3] J. Pomikálek, "Removing boilerplate and duplicate content from web corpora [online]," Ph.D. dissertation, Masarykova univerzita, Fakulta informatiky, 2011. [ Links ]
[4] M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, "Cleaneval: A competition for cleaning web pages," in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), B. M. Nicoletta Calzolari, Khalid Choukri and D. Tapias, Eds. Marrakech, Morocco: European Language Resources Association (ELRA), 2008. [ Links ]
[5] S. Evert, "A lightweight and efficient tool for cleaning web pages," in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), B. M. Nicoletta Calzolari, Khalid Choukri and D. Tapias, Eds. Marrakech, Morocco: European Language Resources Association (ELRA), 2008. [ Links ]
[6] V. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Cybernetics and Control Theory, vol. 10, no. 8, pp. 707710, 1966, original in Doklady Akademii Nauk SSSR 163(4): 845-848 (1965). [ Links ]