Content Extraction based on Hierarchical Relations in DOM Structures

López, Sergio; Silva, Josep; Insa, David

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.45 México jun. 2012

Content Extraction based on Hierarchical Relations in DOM Structures

Sergio López*, Josep Silva**, and David Insa***

The authors are with the Departamento de Sistemas Informáticos y Computación, Universitat Politécnica de Valencia, E-46022 Valencia, Spain (e mail: *slopez@dsic.upv.es, **jsilva@dsic.upv.es, ***dinsa@dsic.upv.es).

Manuscript received on October 21, 2011.
accepted for publication on December 9, 2011.

Abstract

This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the additional information that is not part of the main content. The main idea behind our approach is to use the DOM tree as an explicit representation of the inter-relations of the elements in a webpage. Using the information contained in the DOM tree we can identify blocks of content and we can easily determine what of the blocks contains more text. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.

Key words: Content Extraction, Block Detection, DOM.

DESCARGAR ARTÍCULO EN FORMATO PDF

ACKNOWLEDGMENTS

This work was partially supported by the Spanish Ministerio de Ciencia e Innovación under the grant TIN2008-06622-C03-02 and by the Generalitat Valenciana under the grant PROMETEO/2011/052. David Insa was partially supported by the Ministerio de Educación under grant FPU AP2010-4415.

REFERENCES

[1] D. Gibson, K. Punera, and A. Tomkins, "The volume and evolution of web page templates," in Proceedings of the 14th International Conference on World Wide Web (WWW'05), Chiba, Japan, 2005, pp. 830–839. [ Links ]

[2] T. Gottron, "Content code blurring: A new approach to content extraction," in Proceedings of the 5th International Workshop on Text-Based Information Retrieval (TIR'08), Turin, Italy, 2008, pp. 29–33. [ Links ]

[3] T. Weninger, W. Hsu, and J. Han, "CETR — content extraction via tag ratios," in Proceedings of the 19th International Conference on World Wide Web (WWW'10), North Carolina, USA, 2010, pp. 971–980. [ Links ]

[4] X. Li and B. Liu, "Learning to classify text using positive and unlabeled data," in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'03), Acapulco, Mexico, 2003. [ Links ]

[5] J. Arias, K. Deschacht, and M. Moens, "Language independent content extraction from web pages," in Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop (DIR'09), The Netherlands, 2009, pp. 50–55. [ Links ]

[6] B. Krüpl, M. Herzog, and W. Gatterbauer, "Using visual cues for extraction of tabular data from arbitrary HTML documents," in Proceedings of the 14th International Conference on World Wide Web (WWW'05), Chiba, Japan, 2005. [ Links ]

[7] F. Finn, N. Kushmerick, and B. Smyth, "Fact or fiction: Content classification for digital libraries," in Proceedings of DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, 2001. [ Links ]

[8] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "DOM-based content extraction of HTML documents," in Proceedings of the 12th International Conference on World Wide Web (WWW'03), North Budapest, Hungary, 2003, pp. 207–214. [ Links ]

[9] B. Dalvi, W. W. Cohen, and J. Callan, "Websets: Extracting sets of entities from the web using unsupervised information extraction," Carnegie Mellon School of Computer Science, Tech. Rep., 2011. [ Links ]

[10] N. Kushmerick, D. S. Weld, and R. Doorenbos, "Wrapper induction for information extraction," in Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI'97), 1997. [ Links ]

[11] W. W. Cohen, M. Hurst, and L. S. Jensen, "A flexible learning system for wrapping tables and lists in HTML documents," in Proceedings of the international World Wide Web conference (WWW'02), 2002, pp. 232–241. [ Links ]

[12] C. Kohlschütter and W. Nejdl, "A densitometric approach to web page segmentation," in Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). New York, NY, USA: ACM, 2008, pp. 1173–1182. [ Links ]

[13] C. Kohlschütter, "A densitometric analysis of web template content," in Proceedings of the 18th international World Wide Web conference (WWW'09). New York, NY, USA: ACM, 2009, pp. 1165–1166. [ Links ]

[14] S. Baluja, "Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework," in Proceedings of the 15th International Conference on World Wide Web (WWW'06). New York, NY, USA: ACM, 2006, pp. 33–42. [ Links ]

[15] J. Gibson, B. Wellner, and S. Lubar, "Adaptive web-page content identification," in Proceedings of the 9th annual ACM international workshop on Web information and data management (WIDM '07). New York, NY, USA: ACM, 2007, pp. 105–112. [ Links ]

[16] C. Kohlschütter, P. Fankhauser, and W. Nejdl, "Boilerplate detection using shallow text features," in Proceedings of the third ACM international conference on Web search and data mining (WSDM '10). New York, NY, USA: ACM, 2010, pp. 441–450. [ Links ]

[17] W3C Consortium, "Document Object Model (DOM)." [Online]. Available: www.w3.org/DOM [ Links ]

[18] T. Gottron, "Evaluating content extraction on html documents," in Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA'07). Wrexham, North Wales: 2007, 2007, pp. 123–132. [ Links ]