SciELO - Scientific Electronic Library Online

 
vol.22 número4Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline SystemCorref-PT:A Semi-Automatic Annotated Portuguese Coreference Corpus índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Resumen

LEJEUNE, Gaël  y  ZHU, Lichao. A New Proposal for Evaluating Web Page Cleaning Tools. Comp. y Sist. [online]. 2018, vol.22, n.4, pp.1249-1258.  Epub 10-Feb-2021. ISSN 2007-9737.  https://doi.org/10.13053/cys-22-4-3062.

In this article, we tackle the problem of evaluation of Web Content Extraction tools. This task is seldom studied in the literature although it has important consequences on the linguistic processing of web-based corpora. Here, we compare two types of evaluation. Firstly, an intrinsic (content-based) evaluation which is carried out in a multilingual setting (five languages). Secondly, an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that in the intrinsic evaluation, the results are not consistent with extrinsic evaluation results. We also show that the results differ greatly in the studied languages. We conclude that the choice of a web page cleaning tool should be made with respect to the task that is tackled rather than the performances observed through the intrinsic evaluation scheme.

Palabras llave : Corpus; multilingual corpora; Web content extraction; Web page cleaning; evaluation; classification.

        · texto en Inglés     · Inglés ( pdf )