<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462018000200483</article-id>
<article-id pub-id-type="doi">10.13053/cys-22-2-2959</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Viveros-Jiménez]]></surname>
<given-names><![CDATA[Francisco]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Sanchez-Perez]]></surname>
<given-names><![CDATA[Miguel A.]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Gómez-Adorno]]></surname>
<given-names><![CDATA[Helena]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Posadas-Durán]]></surname>
<given-names><![CDATA[Juan-Pablo]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[Grigori]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Gelbukh]]></surname>
<given-names><![CDATA[Alexander]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Instituto Politécnico Nacional Centro de Investigación en Computación ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,Instituto Politécnico Nacional Escuela Superior de Ingeniería Mecánica y Eléctrica ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>06</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>06</month>
<year>2018</year>
</pub-date>
<volume>22</volume>
<numero>2</numero>
<fpage>483</fpage>
<lpage>489</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462018000200483&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462018000200483&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462018000200483&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Boilerplate removal]]></kwd>
<kwd lng="en"><![CDATA[news extraction]]></kwd>
<kwd lng="en"><![CDATA[HTML tree structure]]></kwd>
<kwd lng="en"><![CDATA[Boilerpipe]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bar-Yossef]]></surname>
<given-names><![CDATA[Z.]]></given-names>
</name>
<name>
<surname><![CDATA[Rajagopalan]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Template detection via data mining and its applications]]></source>
<year>2002</year>
<conf-name><![CDATA[ Eleventh International Conference on World Wide Web, WWW &#8217;02]]></conf-name>
<conf-loc> </conf-loc>
<page-range>580-91</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Baroni]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Chantree]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Kilgarriff]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Sharoff]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Cleaneval: a competition for cleaning web pages]]></source>
<year>2008</year>
<conf-name><![CDATA[ Sixth International Conference on Language Resources and Evaluation, LREC &#8217;08]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cai]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Yu]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Wen]]></surname>
<given-names><![CDATA[J.-R.]]></given-names>
</name>
<name>
<surname><![CDATA[Ma]]></surname>
<given-names><![CDATA[W.-Y.]]></given-names>
</name>
</person-group>
<source><![CDATA[Extracting content structure for web pages based on visual representation]]></source>
<year>2003</year>
<conf-name><![CDATA[ Fifth Asia-Pacific Web Conference on Web Technologies and Applications, APWeb &#8217;03]]></conf-name>
<conf-loc> </conf-loc>
<page-range>406-17</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chakrabarti]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Kumar]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Punera]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<source><![CDATA[Page-level template detection via isotonic smoothing]]></source>
<year>2007</year>
<conf-name><![CDATA[ Sixteenth International Conference on World Wide Web, WWW &#8217;07]]></conf-name>
<conf-loc> </conf-loc>
<page-range>61-70</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Christen]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[A survey of indexing techniques for scalable record linkage and deduplication]]></article-title>
<source><![CDATA[IEEE transactions on knowledge and data engineering]]></source>
<year>2012</year>
<volume>24</volume>
<numero>9</numero>
<issue>9</issue>
<page-range>1537-55</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Churches]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Christen]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Lim]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhu]]></surname>
<given-names><![CDATA[J. X.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Preparation of name and address data for record linkage using hidden markov models]]></article-title>
<source><![CDATA[BMC Medical Informatics and Decision Making]]></source>
<year>2002</year>
<volume>2</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>9</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Clark]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Practical introduction to record linkage for injury research]]></article-title>
<source><![CDATA[Injury Prevention]]></source>
<year>2004</year>
<volume>10</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>186-91</page-range></nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Debnath]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Mitra]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Pal]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Giles]]></surname>
<given-names><![CDATA[C. L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Automatic identification of informative sections of web pages]]></article-title>
<source><![CDATA[IEEE transactions on knowledge and data engineering]]></source>
<year>2005</year>
<volume>17</volume>
<numero>9</numero>
<issue>9</issue>
<page-range>1233-46</page-range></nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Endrédy]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Novák]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[More effective Boilerplate removal-the GoldMiner algorithm]]></article-title>
<source><![CDATA[Polibits]]></source>
<year>2013</year>
<volume>48</volume>
<page-range>79-83</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Evert]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<source><![CDATA[A lightweight and efficient tool for cleaning web pages]]></source>
<year>2008</year>
<conf-name><![CDATA[ Sixth International Conference on Language Resources and Evaluation, LREC &#8217;08]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ferrara]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Meo]]></surname>
<given-names><![CDATA[P. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Fiumara]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Baumgartner]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Web data extraction, applications and techniques: A survey]]></article-title>
<source><![CDATA[Knowledge-Based Systems]]></source>
<year>2014</year>
<volume>70</volume>
<page-range>301-23</page-range></nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gao]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
<name>
<surname><![CDATA[Abou-Assaleh]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Genieknows web page cleaning system]]></source>
<year>2007</year>
<conf-name><![CDATA[ Building and Exploring Web Corpora: Proceedings of the Third Web as Corpus Workshop, incorporating Cleaneval]]></conf-name>
<conf-loc> </conf-loc>
<page-range>135</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gibson]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Punera]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Tomkins]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[The volume and evolution of web page templates]]></source>
<year>2005</year>
<conf-name><![CDATA[ Special Interest Tracks and Posters of the Fourteenth International Conference on World Wide Web, WWW &#8217;05]]></conf-name>
<conf-loc> </conf-loc>
<page-range>830-9</page-range></nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gibson]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Wellner]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Lubar]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Adaptive web-page content identification]]></source>
<year>2007</year>
<conf-name><![CDATA[ Ninth Annual ACM International Workshop on Web Information and Data Management, WIDM &#8217;07]]></conf-name>
<conf-loc> </conf-loc>
<page-range>105-12</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Girardi]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
</person-group>
<source><![CDATA[Htmcleaner: Extracting the relevant text from the web pages]]></source>
<year>2007</year>
<conf-name><![CDATA[ Third Web as Corpus Workshop, WAC &#8217;07]]></conf-name>
<conf-loc> </conf-loc>
<page-range>15-6</page-range></nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hofmann]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Weerkamp]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Web Corpus Cleaning using Content and Structure]]></source>
<year>2007</year>
<conf-name><![CDATA[ Building and Exploring Web Corpora: Proceedings of the Third Web as Corpus Workshop, WAC &#8217;07]]></conf-name>
<conf-loc> </conf-loc>
<page-range>145-54</page-range></nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kilgarriff]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Rychly]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Smrz]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Tugwell]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[The sketch engine]]></source>
<year>2004</year>
<conf-name><![CDATA[ EURALEX]]></conf-name>
<conf-loc> </conf-loc>
<page-range>105-â&#8220;116</page-range></nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kohlschütter]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
</person-group>
<source><![CDATA[A densitometric analysis of web template content]]></source>
<year>2009</year>
<conf-name><![CDATA[ Eighteenth International Conference on World Wide Web, WWW &#8217;09]]></conf-name>
<conf-loc> </conf-loc>
<page-range>1165-6</page-range></nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kohlschütter]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Fankhauser]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Nejdl]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Boilerplate detection using shallow text features]]></source>
<year>2010</year>
<conf-name><![CDATA[ Third ACM International Conference on Web Search and Data Mining, WSDM &#8217;10]]></conf-name>
<conf-loc> </conf-loc>
<page-range>441-50</page-range></nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Marek]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Pecina]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Spousta]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Web page cleaning with conditional random fields]]></source>
<year>2007</year>
<conf-name><![CDATA[ Building and Exploring Web Corpora: Proceedings of the Third Web as Corpus Workshop, incorporating Cleaneval]]></conf-name>
<conf-loc> </conf-loc>
<page-range>155</page-range></nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pasternack]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Roth]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Extracting article text from the web with maximum subsequence segmentation]]></source>
<year>2009</year>
<conf-name><![CDATA[ Eighteenth International Conference on World Wide Web, WWW &#8217;09]]></conf-name>
<conf-loc> </conf-loc>
<page-range>971-80</page-range></nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pomikálek]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<source><![CDATA[Removing Boilerplate and Duplicate Content from Web Corpora]]></source>
<year>2011</year>
<publisher-name><![CDATA[Masaryk University, Faculty of Informatics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B23">
<label>23</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rahm]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Do]]></surname>
<given-names><![CDATA[H. H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Data cleaning: Problems and current approaches]]></article-title>
<source><![CDATA[IEEE Data Eng. Bull.]]></source>
<year>2000</year>
<volume>23</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>3-13</page-range></nlm-citation>
</ref>
<ref id="B24">
<label>24</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Schäfer]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Bildhauer]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<source><![CDATA[Building large corpora from the web using a new efficient tool chain]]></source>
<year>2012</year>
<conf-name><![CDATA[ Eight International Conference on Language Resources and Evaluation, LREC &#8217;12]]></conf-name>
<conf-loc> </conf-loc>
<page-range>486-93</page-range></nlm-citation>
</ref>
<ref id="B25">
<label>25</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yi]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Liu]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Li]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
</person-group>
<source><![CDATA[Eliminating noisy information in web pages for data mining]]></source>
<year>2003</year>
<conf-name><![CDATA[ Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD &#8217;03]]></conf-name>
<conf-loc> </conf-loc>
<page-range>296-305</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
