<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1870-9044</journal-id>
<journal-title><![CDATA[Polibits]]></journal-title>
<abbrev-journal-title><![CDATA[Polibits]]></abbrev-journal-title>
<issn>1870-9044</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Innovación y Desarrollo Tecnológico en Cómputo]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1870-90442013000200011</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[More Effective Boilerplate Removal-the GoldMiner Algorithm]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Endrédy]]></surname>
<given-names><![CDATA[István]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Novák]]></surname>
<given-names><![CDATA[Attila]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Pazmany Peter Catholic University Faculty of Information Technology and Bionics ]]></institution>
<addr-line><![CDATA[Budapest ]]></addr-line>
<country>Hungary</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2013</year>
</pub-date>
<numero>48</numero>
<fpage>79</fpage>
<lpage>83</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1870-90442013000200011&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1870-90442013000200011&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1870-90442013000200011&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[The ever-increasing web is an important source for building large-scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web-based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Corpus building]]></kwd>
<kwd lng="en"><![CDATA[boilerplate removal]]></kwd>
<kwd lng="en"><![CDATA[the web as corpus]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[   	    <p align="center"><font face="verdana" size="4"><b>More Effective Boilerplate Removal&#151;the GoldMiner Algorithm</b></font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="center"><font face="verdana" size="2"><b>Istv&aacute;n Endr&eacute;dy, Attila Nov&aacute;k</b></font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="justify"><font face="verdana" size="2"><i>The authors are with the MTA&#45;PPKE Language Technology Research Group and Pazmany Peter Catholic University, Faculty of Information Technology and Bionics, 50/a Prater street, 1083 Budapest, Hungary</i> (e&#45;mail: <a href="mailto:endredy.istvan.gergely@itk.ppke.hu">endredy.istvan.gergely@itk.ppke.hu</a>, <a href="mailto:novak.attila@itk.ppke.hu">novak.attila@itk.ppke.hu</a>).</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="justify"><font face="verdana" size="2">Manuscript received on July 31, 2013;    <br> 	  accepted for publication on September 30, 2013. </font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><b>Abstract</b></font></p>  	    <p align="justify"><font face="verdana" size="2">The ever&#45;increasing web is an important source for building large&#45;scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web&#45;based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages.</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>Key words:</b> Corpus building, boilerplate removal, the web as corpus.</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font><font face="verdana" size="2">&nbsp;</font></p>         <p align="justify"><font face="verdana" size="2"><a href="/pdf/poli/n48/n48a11.pdf" target="_blank">DESCARGAR ART&Iacute;CULO EN FORMATO PDF</a></font></p>         <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><b>Acknowledgments</b></font></p>  	    <p align="justify"><font face="verdana" size="2">This research was partially supported by the project grants T&Aacute;MOP&#45;4.2.1./B&#45;11/2&#45;KMR&#45;2011&#45;0002 and T&Aacute;MOP&#45;4.2.2./ B&#45;10/1&#45;2010&#45;0014.</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>References</b></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">&#91;1&#93; A. Finn, N. Kushmerick, and B. Smyth, "Fact or fiction: Content classification for digital libraries," in <i>DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries,</i> 2001.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082179&pid=S1870-9044201300020001100001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;2&#93; C. Kohlschtitter, P. Fankhauser, and W. Nejdl, "Boilerplate detection using shallow text features," in <i>Proceedings of the third ACM international conference on Web search and data mining,</i> ser. WSDM '10. New York, NY, USA: ACM, 2010, pp. 441&#45;450. &#91;Online&#93;, Available: <a href="http://doi.acm.org/10.1145/1718487.1718542" target="_blank">http://doi.acm.org/10.1145/1718487.1718542</a></font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082181&pid=S1870-9044201300020001100002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">&#91;3&#93; J. Pomik&aacute;lek, "Removing boilerplate and duplicate content from web corpora &#91;online&#93;," Ph.D. dissertation, Masarykova univerzita, Fakulta informatiky, 2011.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082182&pid=S1870-9044201300020001100003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;4&#93; M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, "Cleaneval: A competition for cleaning web pages," in <i>Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08),</i> B. M. Nicoletta Calzolari, Khalid Choukri and D. Tapias, Eds. Marrakech, Morocco: European Language Resources Association (ELRA), 2008.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082184&pid=S1870-9044201300020001100004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;5&#93; S. Evert, "A lightweight and efficient tool for cleaning web pages," in <i>Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08),</i> B. M. Nicoletta Calzolari, Khalid Choukri and D. Tapias, Eds. Marrakech, Morocco: European Language Resources Association (ELRA), 2008.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082186&pid=S1870-9044201300020001100005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">&#91;6&#93; V. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," <i>Cybernetics and Control Theory,</i> vol. 10, no. 8, pp. 707710, 1966, original in <i>Doklady Akademii Nauk SSSR</i> 163(4): 845&#45;848 (1965).    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=6082188&pid=S1870-9044201300020001100006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>      ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Finn]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Kushmerick]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Smyth]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Fact or fiction: Content classification for digital libraries]]></article-title>
<source><![CDATA[DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries]]></source>
<year>2001</year>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kohlschtitter]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Fankhauser]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Nejdl]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Boilerplate detection using shallow text features]]></article-title>
<source><![CDATA[Proceedings of the third ACM international conference on Web search and data mining, ser. WSDM '10]]></source>
<year>2010</year>
<page-range>441-450</page-range><publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[ACM]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pomikálek]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Removing boilerplate and duplicate content from web corpora]]></source>
<year></year>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Baroni]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Chantree]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Kilgarriff]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Sharoff]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Cleaneval: A competition for cleaning web pages]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Nicoletta Calzolari]]></surname>
<given-names><![CDATA[B. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Choukri]]></surname>
<given-names><![CDATA[Khalid]]></given-names>
</name>
<name>
<surname><![CDATA[Tapias]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)]]></source>
<year>2008</year>
<publisher-loc><![CDATA[Marrakech ]]></publisher-loc>
<publisher-name><![CDATA[European Language Resources Association (ELRA)]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Evert]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A lightweight and efficient tool for cleaning web pages]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Nicoletta Calzolari]]></surname>
<given-names><![CDATA[B. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Choukri]]></surname>
<given-names><![CDATA[Khalid]]></given-names>
</name>
<name>
<surname><![CDATA[Tapias]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)]]></source>
<year>2008</year>
<publisher-loc><![CDATA[Marrakech ]]></publisher-loc>
<publisher-name><![CDATA[European Language Resources Association (ELRA)]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Levenshtein]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Binary codes capable of correcting deletions, insertions, and reversals]]></article-title>
<source><![CDATA[Cybernetics and Control Theory]]></source>
<year>1966</year>
<volume>10</volume>
<numero>8</numero>
<issue>8</issue>
<page-range>707-710</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
