<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462022000100183</article-id>
<article-id pub-id-type="doi">10.13053/cys-26-1-4163</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Jiménez]]></surname>
<given-names><![CDATA[Diana]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Gambino]]></surname>
<given-names><![CDATA[Omar J.]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Calvo]]></surname>
<given-names><![CDATA[Hiram]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Instituto Politécnico Nacional Escuela Superior de Cómputo ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,Instituto Politécnico Nacional Centro de Investigación en Computación ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Mexico</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>03</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>03</month>
<year>2022</year>
</pub-date>
<volume>26</volume>
<numero>1</numero>
<fpage>183</fpage>
<lpage>193</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462022000100183&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462022000100183&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462022000100183&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier&#8217;s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[News identification]]></kwd>
<kwd lng="en"><![CDATA[semi-supervised classification]]></kwd>
<kwd lng="en"><![CDATA[news categorization]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bracewell]]></surname>
<given-names><![CDATA[D. B.]]></given-names>
</name>
<name>
<surname><![CDATA[Yan]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Ren]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Kuroiwa]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Category classification and topic discovery of japanese and english news articles]]></article-title>
<source><![CDATA[Electronic Notes in Theoretical Computer Science]]></source>
<year>2009</year>
<volume>225</volume>
<page-range>51-65</page-range></nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Fernández]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[García]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Galar]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Prati]]></surname>
<given-names><![CDATA[R. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Krawczyk]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Herrera]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<source><![CDATA[Learning from imbalanced data sets]]></source>
<year>2018</year>
<volume>10</volume>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[García-Mendoza]]></surname>
<given-names><![CDATA[C.-V.]]></given-names>
</name>
<name>
<surname><![CDATA[Gambino]]></surname>
<given-names><![CDATA[O. J.]]></given-names>
</name>
</person-group>
<source><![CDATA[News article classification of mexican newspapers]]></source>
<year>2018</year>
<conf-name><![CDATA[ International Congress of Telematics and Computing]]></conf-name>
<conf-loc> </conf-loc>
<page-range>101-9</page-range></nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gunther]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Beck]]></surname>
<given-names><![CDATA[P. A.]]></given-names>
</name>
<name>
<surname><![CDATA[Nisbet]]></surname>
<given-names><![CDATA[E. C.]]></given-names>
</name>
</person-group>
<source><![CDATA[Fake news did have a significant impact on the vote in the 2016 election: Original full-length version with methodological appendix]]></source>
<year>2018</year>
<publisher-loc><![CDATA[Columbus, OH ]]></publisher-loc>
<publisher-name><![CDATA[Ohio State University]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Martin]]></surname>
<given-names><![CDATA[S. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Copeland]]></surname>
<given-names><![CDATA[D. A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Function of Newspapers in Society: A Global Perspective]]></source>
<year>2003</year>
<publisher-name><![CDATA[Praeger]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Maududie]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Retnani]]></surname>
<given-names><![CDATA[W. E. Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Rohim]]></surname>
<given-names><![CDATA[M. A.]]></given-names>
</name>
</person-group>
<source><![CDATA[An approach of web scraping on news website based on regular expression]]></source>
<year>2018</year>
<conf-name><![CDATA[ 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)]]></conf-name>
<conf-date>2018</conf-date>
<conf-loc> </conf-loc>
<page-range>203-7</page-range></nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mitchell]]></surname>
<given-names><![CDATA[T. M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Machine learning]]></source>
<year>1997</year>
<publisher-loc><![CDATA[New York ]]></publisher-loc>
<publisher-name><![CDATA[McGraw-hill]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Nurfikri]]></surname>
<given-names><![CDATA[F. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Mubarok]]></surname>
<given-names><![CDATA[M. S.]]></given-names>
</name>
</person-group>
<collab>Adiwijaya</collab>
<source><![CDATA[News topic classification using mutual information and bayesian network]]></source>
<year>2018</year>
<conf-name><![CDATA[ 6th International Conference on Information and Communication Technology (ICoICT)]]></conf-name>
<conf-date>2018</conf-date>
<conf-loc> </conf-loc>
<page-range>162-6</page-range></nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Phuvipadawat]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Murata]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Breaking news detection and tracking in twitter]]></source>
<year>2010</year>
<volume>3</volume>
<conf-name><![CDATA[ IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology]]></conf-name>
<conf-date>2010</conf-date>
<conf-loc> </conf-loc>
<page-range>120-3</page-range><publisher-name><![CDATA[IEEE]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sankaranarayanan]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Samet]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Teitler]]></surname>
<given-names><![CDATA[B. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Lieberman]]></surname>
<given-names><![CDATA[M. D.]]></given-names>
</name>
<name>
<surname><![CDATA[Sperling]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Twitterstand: News in tweets]]></source>
<year>2009</year>
<conf-name><![CDATA[ 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems]]></conf-name>
<conf-loc> </conf-loc>
<page-range>42-51</page-range></nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sarkar]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Text Analytics with Python]]></source>
<year>2016</year>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Schmitt]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
<name>
<surname><![CDATA[Kubler]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Robert]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Papadakis]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[LeTraon]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
</person-group>
<source><![CDATA[A replicable comparison study of ner software: Stanfordnlp, nltk, opennlp, spacy, gate]]></source>
<year>2019</year>
<conf-name><![CDATA[ Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)]]></conf-name>
<conf-date>2019</conf-date>
<conf-loc> </conf-loc>
<page-range>338-43</page-range></nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Stallone]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Do not trust me: How news readers perceive and recognize native advertising]]></article-title>
<source><![CDATA[IADIS International Journal on WWW/Internet]]></source>
<year>2020</year>
<volume>18</volume>
<numero>1</numero>
<issue>1</issue>
</nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Tudisco]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Benson]]></surname>
<given-names><![CDATA[A. R.]]></given-names>
</name>
<name>
<surname><![CDATA[Prokopchik]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<source><![CDATA[Nonlinear higher-order label spreading]]></source>
<year>2020</year>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wright]]></surname>
<given-names><![CDATA[R. E.]]></given-names>
</name>
</person-group>
<source><![CDATA[Logistic regression]]></source>
<year>1995</year>
<publisher-name><![CDATA[American Psychological Association]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yarowsky]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Unsupervised word sense disambiguation rivaling supervised methods]]></source>
<year>1995</year>
<conf-name><![CDATA[ 33rd annual meeting of the association for computational linguistics]]></conf-name>
<conf-loc> </conf-loc>
<page-range>189-96</page-range></nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Zhang]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Jin]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhou]]></surname>
<given-names><![CDATA[Z.-H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Understanding bag-of-words model: a statistical framework]]></article-title>
<source><![CDATA[International Journal of Machine Learning and Cybernetics]]></source>
<year>2010</year>
<volume>1</volume>
<numero>1-4</numero>
<issue>1-4</issue>
<page-range>43-52</page-range></nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Zhao]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhang]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Yuan]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Liu]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Shan]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhang]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[The study on the text classification for financial news based on partial information]]></article-title>
<source><![CDATA[IEEE Access]]></source>
<year>2020</year>
<volume>8</volume>
<page-range>100426-37</page-range></nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Zhu]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
<name>
<surname><![CDATA[Goldberg]]></surname>
<given-names><![CDATA[A. B.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Introduction to semi-supervised learning]]></article-title>
<source><![CDATA[Synthesis lectures on artificial intelligence and machine learning]]></source>
<year>2009</year>
<volume>3</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-130</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
