<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462018000401249</article-id>
<article-id pub-id-type="doi">10.13053/cys-22-4-3062</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[A New Proposal for Evaluating Web Page Cleaning Tools]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Lejeune]]></surname>
<given-names><![CDATA[Gaël]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Zhu]]></surname>
<given-names><![CDATA[Lichao]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Sorbonne University  ]]></institution>
<addr-line><![CDATA[Paris ]]></addr-line>
<country>France</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,Paris XIII University  ]]></institution>
<addr-line><![CDATA[Villetaneuse ]]></addr-line>
<country>France</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2018</year>
</pub-date>
<volume>22</volume>
<numero>4</numero>
<fpage>1249</fpage>
<lpage>1258</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462018000401249&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462018000401249&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462018000401249&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: In this article, we tackle the problem of evaluation of Web Content Extraction tools. This task is seldom studied in the literature although it has important consequences on the linguistic processing of web-based corpora. Here, we compare two types of evaluation. Firstly, an intrinsic (content-based) evaluation which is carried out in a multilingual setting (five languages). Secondly, an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that in the intrinsic evaluation, the results are not consistent with extrinsic evaluation results. We also show that the results differ greatly in the studied languages. We conclude that the choice of a web page cleaning tool should be made with respect to the task that is tackled rather than the performances observed through the intrinsic evaluation scheme.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Corpus]]></kwd>
<kwd lng="en"><![CDATA[multilingual corpora]]></kwd>
<kwd lng="en"><![CDATA[Web content extraction]]></kwd>
<kwd lng="en"><![CDATA[Web page cleaning]]></kwd>
<kwd lng="en"><![CDATA[evaluation]]></kwd>
<kwd lng="en"><![CDATA[classification]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Baluja]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<source><![CDATA[Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework]]></source>
<year>2006</year>
<conf-name><![CDATA[ 15th international conference on World Wide Web, WWW &#8217;06]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>33-42</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Barbaresi]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<source><![CDATA[Ad hoc and general-purpose corpus construction from web sources]]></source>
<year>2015</year>
<publisher-loc><![CDATA[France ]]></publisher-loc>
<publisher-name><![CDATA[École normale supérieure de Lyon]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Biemann]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Bildhauer]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Evert]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Goldhahn]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Quasthoff]]></surname>
<given-names><![CDATA[U.]]></given-names>
</name>
<name>
<surname><![CDATA[Sch¨afer]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Simon]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Swiezinski]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Zesch]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Scalable construction of high-quality web corpora]]></article-title>
<source><![CDATA[JLCL]]></source>
<year>2013</year>
<volume>28</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>23-59</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brixtel]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Lejeune]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Doucet]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Lucas]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<source><![CDATA[Any Language Early Detection of Epidemic Diseases from Web News Streams]]></source>
<year>2013</year>
<conf-name><![CDATA[ International Conference on Healthcare Informatics (ICHI)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>159-68</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chakrabarti]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Kumar]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Punera]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<source><![CDATA[A graph-theoretic approach to webpage segmentation]]></source>
<year>2008</year>
<conf-name><![CDATA[ 17th international conference on World Wide Web, WWW &#8217;08]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>377-86</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Das]]></surname>
<given-names><![CDATA[S. N.]]></given-names>
</name>
<name>
<surname><![CDATA[Vijayaraghavan]]></surname>
<given-names><![CDATA[P. K.]]></given-names>
</name>
<name>
<surname><![CDATA[Mathew]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Article: Eliminating noisy information in web pages using featured dom tree]]></article-title>
<source><![CDATA[International Journal of Applied Information Systems]]></source>
<year>2012</year>
<volume>2</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>27-34</page-range><publisher-loc><![CDATA[New York, USA ]]></publisher-loc>
<publisher-name><![CDATA[Published by Foundation of Computer Science]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Doucet]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Kazai]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Meunier]]></surname>
<given-names><![CDATA[J.-L.]]></given-names>
</name>
</person-group>
<source><![CDATA[ICDAR 2011 Book Structure Extraction Competition]]></source>
<year>2011</year>
<conf-name><![CDATA[ Eleventh International Conference on Document Analysis and Recognition (ICDAR&#8217;2011)]]></conf-name>
<conf-loc>Beijing, China </conf-loc>
<page-range>1501-5</page-range></nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Evert]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<source><![CDATA[A lightweight and efficient tool for cleaning web pages]]></source>
<year>2008</year>
<conf-name><![CDATA[ LREC 2008]]></conf-name>
<conf-loc> </conf-loc>
<page-range>3489-93</page-range></nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ferraresi]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Zanchetta]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Baroni]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Bernardini]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Introducing and evaluating ukwac, a very large web-derived corpus of english]]></source>
<year>2008</year>
<conf-name><![CDATA[ 4th Web as Corpus Workshop, LREC 2008]]></conf-name>
<conf-loc> </conf-loc>
<page-range>47-54</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fu]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Meng]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Xia]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Yu]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<source><![CDATA[Web content extraction based on webpage layout analysis]]></source>
<year>2010</year>
<conf-name><![CDATA[ 2010 Second International Conference on Information Technology and Computer Science]]></conf-name>
<conf-loc> </conf-loc>
<page-range>40-3</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kohlschütter]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Fankhauser]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Nejdl]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Boilerplate detection using shallow text features]]></source>
<year>2010</year>
<conf-name><![CDATA[ third ACM international conference on Web search and data mining, WSDM &#8217;10]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>441-50</page-range></nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pasternack]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Roth]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Extracting article text from the web with maximum subsequence segmentation]]></source>
<year>2009</year>
<page-range>971-80</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pomikálek]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<source><![CDATA[Removing boilerplate and duplicate content from web corpora]]></source>
<year>2011</year>
<conf-name><![CDATA[ Disertacn&#305; práce, Masarykova univerzita, Fakulta informatiky]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ratcliff]]></surname>
<given-names><![CDATA[J. W.]]></given-names>
</name>
<name>
<surname><![CDATA[Metzener]]></surname>
<given-names><![CDATA[D. E.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Pattern matching: The gestalt approach]]></article-title>
<source><![CDATA[Dr. Dobbs Journal]]></source>
<year>1988</year>
<volume>13</volume>
<numero>7</numero>
<issue>7</issue>
<page-range>46-68-72</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Spousta]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Marek]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Pecina]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<source><![CDATA[Victor: the Web-Page Cleaning Tool]]></source>
<year>2008</year>
<conf-name><![CDATA[ 4th Web as Corpus Workshop, LREC 2008]]></conf-name>
<conf-loc> </conf-loc>
<page-range>12-7</page-range></nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Vieira]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[da Silva]]></surname>
<given-names><![CDATA[A. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Pinto]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[de Moura]]></surname>
<given-names><![CDATA[E. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Cavalcanti]]></surname>
<given-names><![CDATA[J. a. M. B.]]></given-names>
</name>
<name>
<surname><![CDATA[Freire]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[A fast and robust method for web page template de-tection and removal]]></source>
<year>2006</year>
<conf-name><![CDATA[ ACM international conference on Information and knowledge management, CIKM &#8217;06]]></conf-name>
<conf-loc>New York, NY, USA </conf-loc>
<page-range>258-67</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
