<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462018000401241</article-id>
<article-id pub-id-type="doi">10.13053/cys-22-4-3061</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Khandelwal]]></surname>
<given-names><![CDATA[Ankush]]></given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Swami]]></surname>
<given-names><![CDATA[Sahil]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Akhtar]]></surname>
<given-names><![CDATA[Syed Sarfaraz]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Shrivastava]]></surname>
<given-names><![CDATA[Manish]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,International Institute of Information Technology Language Technologies Research Centre ]]></institution>
<addr-line><![CDATA[Hyderabad ]]></addr-line>
<country>India</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2018</year>
</pub-date>
<volume>22</volume>
<numero>4</numero>
<fpage>1241</fpage>
<lpage>1247</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462018000401241&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462018000401241&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462018000401241&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: The rapid expansion in the usage of social media networking sites leads to a huge amount of unprocessed user generated data which can be used for text mining. Author profiling is the problem of automatically determining profiling aspects like the author&#8217;s gender and age group through a text is gaining much popularity in computational linguistics. Most of the past research in author profiling is concentrated on English texts [1, 2]. However many users often change the language while posting on social media which is called code-mixing, and it develops some challenges in the field of text classification and author profiling like variations in spelling, non-grammatical structure and transliteration [3]. There are very few English-Hindi code-mixed annotated datasets of social media content present online [4]. In this paper, we analyze the task of author&#8217;s gender prediction in code-mixed content and present a corpus of English-Hindi texts collected from Twitter which is annotated with author&#8217;s gender. We also explore language identification of every word in this corpus. We present a supervised classification baseline system which uses various machine learning algorithms to identify the gender of an author using a text, based on character and word level features.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Author profiling]]></kwd>
<kwd lng="en"><![CDATA[code-mixing]]></kwd>
<kwd lng="en"><![CDATA[language detection]]></kwd>
<kwd lng="en"><![CDATA[linguistics]]></kwd>
<kwd lng="en"><![CDATA[SVM]]></kwd>
<kwd lng="en"><![CDATA[random forest]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Estival]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Author profiling for English emails]]></source>
<year>2007</year>
<conf-name><![CDATA[ 10th Conference of the Pacific Association for Computational Linguistics]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Peersman]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Walter]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Van-Vaerenbergh]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<source><![CDATA[Predicting age and gender in online social networks]]></source>
<year>2011</year>
<conf-name><![CDATA[ 3rd international workshop on Search and mining user-generated contents, (ACM&#8217;11)]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Barman]]></surname>
<given-names><![CDATA[U.]]></given-names>
</name>
</person-group>
<source><![CDATA[Code mixing: A challenge for language identification in the language of social media]]></source>
<year>2014</year>
<conf-name><![CDATA[ The First Workshop on Computational Approaches to Code Switching]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Vyas]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[POS Tagging of English-Hindi Code-Mixed Social Media Content]]></article-title>
<source><![CDATA[EMNLP]]></source>
<year>2014</year>
<volume>14</volume>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Marquardt]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Age and gender identification in social media]]></source>
<year>2014</year>
<conf-name><![CDATA[ (CLEF&#8217;14), Evaluation Labs]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Myers-Scotton]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
</person-group>
<source><![CDATA[Duelling languages: Grammatical structure in codeswitching]]></source>
<year>1997</year>
<publisher-name><![CDATA[Oxford University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gumperz]]></surname>
<given-names><![CDATA[J.J]]></given-names>
</name>
</person-group>
<source><![CDATA[Discourse strategies]]></source>
<year>1982</year>
<volume>1</volume>
<publisher-name><![CDATA[Cambridge University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Danet]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Herring]]></surname>
<given-names><![CDATA[S.C.]]></given-names>
</name>
</person-group>
<source><![CDATA[The multilingual Internet: Language, culture, and communication online]]></source>
<year>2007</year>
<publisher-name><![CDATA[Oxford University Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cárdenas-Claros]]></surname>
<given-names><![CDATA[M. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Isharyanti]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Code-switching and code-mixing in internet chatting: Between&#8217;yes,&#8221; ya,&#8217;and&#8217;si&#8217;-a case study]]></article-title>
<source><![CDATA[The Jalt Call Journal]]></source>
<year>2009</year>
<page-range>67-78</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Auer]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Raihan]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Embedded language&#8217;and &#8216;matrix language&#8217;in insertional language mixing: Some problematic cases]]></article-title>
<source><![CDATA[Rivista di linguistica]]></source>
<year>2005</year>
<page-range>35-54</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Billal]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Fonseca]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Sadat]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<source><![CDATA[Named Entity Recognition and Hashtag De-composition to Improve the Classification of Tweets]]></source>
<year>2016</year>
<conf-name><![CDATA[ 2nd Workshop on Noisy User-generated Text (WNUT&#8217;16)]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Khandelwal]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Swami]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Akhtar]]></surname>
<given-names><![CDATA[S.S.]]></given-names>
</name>
<name>
<surname><![CDATA[Shrivastava]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Classification Of Spanish Election Tweets (COSET) 2017: Classifying Tweets using Character and Word Level Features]]></source>
<year>2017</year>
<conf-name><![CDATA[ Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval&#8217;17), CEUR Workshop Proceedings]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lodhi]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Text classification using string kernels]]></article-title>
<source><![CDATA[Journal of Machine Learning Research]]></source>
<year>2002</year>
<page-range>419-44</page-range></nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cavnar]]></surname>
<given-names><![CDATA[W.B.]]></given-names>
</name>
<name>
<surname><![CDATA[Trenkle]]></surname>
<given-names><![CDATA[J.M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[N-gram-based text categorization]]></article-title>
<source><![CDATA[Ann Arbor MI 48113.2]]></source>
<year>1994</year>
<page-range>161-75</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Argamon]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Gender, genre, and writing style in formal written texts]]></source>
<year>2003</year>
<page-range>321-46</page-range><publisher-name><![CDATA[Text-The Hague Then Amsterdam Then Berlin]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Joachims]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Transductive inference for text classification using support vector machines]]></article-title>
<source><![CDATA[ICML]]></source>
<year>1999</year>
<volume>99</volume>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mccord]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Chuah]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Spam detection on twitter using traditional classifiers]]></source>
<year>2011</year>
<conf-name><![CDATA[ International conference on Autonomic and trusted computing]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pedregosa]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Scikit-learn: Machine learning in Python]]></article-title>
<source><![CDATA[Journal of Machine Learning Research]]></source>
<year>2011</year>
<page-range>2825-30</page-range></nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lusa]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Challenges in projecting clustering results across gene expression-profiling datasets]]></article-title>
<source><![CDATA[JNCI: Journal of the National Cancer Institute]]></source>
<year>2007</year>
<page-range>1715-23</page-range></nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Du]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Supervised classification using balanced training]]></source>
<year>2014</year>
<conf-name><![CDATA[ International Conference on Statistical Language and Speech Processing]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
