<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462018000300729</article-id>
<article-id pub-id-type="doi">10.13053/cys-22-3-3034</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Unsupervised Creation of Normalization Dictionaries for Micro-Blogs in Arabic, French and English]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Htait]]></surname>
<given-names><![CDATA[Amal]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
<xref ref-type="aff" rid="Aaf"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Fournier]]></surname>
<given-names><![CDATA[Sébastien]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
<xref ref-type="aff" rid="Aaf"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Bellot]]></surname>
<given-names><![CDATA[Patrice]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
<xref ref-type="aff" rid="Aaf"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Université de Toulon  ]]></institution>
<addr-line><![CDATA[Marseille ]]></addr-line>
<country>France</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,Avignon Université  ]]></institution>
<addr-line><![CDATA[Marseille ]]></addr-line>
<country>France</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>09</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>09</month>
<year>2018</year>
</pub-date>
<volume>22</volume>
<numero>3</numero>
<fpage>729</fpage>
<lpage>737</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462018000300729&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462018000300729&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462018000300729&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: Text normalization is a necessity to correct and make more sense of the micro-blogs messages, for information retrieval purposes. Unfortunately, tools and resources of text normalization are rarely shared. In this paper, an approach is presented based on an unsupervised method for text normalization using distributed representations of words, known also as "word embedding", applied on Arabic, French and English Languages. In addition, a tool will be supplied to create dictionaries for micro-blogs normalization, in a form of pairs of misspelled word with its standard-form word, in the languages: Arabic, French and English. The tool will be available as open source1 including the resources: word embedding&#8217;s models (with vocabulary size of 9 million words for Arabic language model, 5 million words for English language model and 683 thousand words for French language model), and also three normalization dictionaries of 10 thousand pairs in Arabic language, 3 thousand pairs in French language and 18 thousand pairs in English language. The evaluation of the tool shows an average in Normalization success of 96% for English language, 89.5% for Arabic Language and 85% for French Language. Also, the results of using an English language normalization dictionary with a sentiment analysis tool for micro-blog&#8217;s messages, show an increase in f-measure from 58.15 to 59.56.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Normalization]]></kwd>
<kwd lng="en"><![CDATA[dictionaries]]></kwd>
<kwd lng="en"><![CDATA[word embedding]]></kwd>
<kwd lng="en"><![CDATA[micro-blogs]]></kwd>
<kwd lng="en"><![CDATA[unsupervised]]></kwd>
<kwd lng="en"><![CDATA[multilingual]]></kwd>
<kwd lng="en"><![CDATA[Arabic]]></kwd>
<kwd lng="en"><![CDATA[French]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sridhar]]></surname>
<given-names><![CDATA[V. K. R.]]></given-names>
</name>
</person-group>
<source><![CDATA[Unsupervised text normalization using distributed representations of words and phrases]]></source>
<year>2015</year>
<conf-name><![CDATA[ 1st Workshop on Vector Space Modeling for Natural Language Processing]]></conf-name>
<conf-loc> </conf-loc>
<page-range>8-16</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Shannon]]></surname>
<given-names><![CDATA[C. E.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[A Mathematical Theory of Communication]]></article-title>
<source><![CDATA[The Bell System Technical Journal]]></source>
<year>1948</year>
<volume>27</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>379-423</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brill]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Moore]]></surname>
<given-names><![CDATA[R. C.]]></given-names>
</name>
</person-group>
<source><![CDATA[An improved error model for noisy channel spelling correction]]></source>
<year>2000</year>
<conf-name><![CDATA[ 38th Annual Meeting on Association for Computational Linguistics]]></conf-name>
<conf-loc> </conf-loc>
<page-range>286-93</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cook]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Stevenson]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[An unsupervised model for text message normalization]]></source>
<year>2009</year>
<conf-name><![CDATA[ workshop on computational approaches to linguistic creativity]]></conf-name>
<conf-loc> </conf-loc>
<page-range>71-8</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Aw]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhang]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Xiao]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[A phrase-based statistical model for SMS text normalization]]></source>
<year>2006</year>
<conf-name><![CDATA[ COLING/ACL on Main conference poster sessions]]></conf-name>
<conf-loc> </conf-loc>
<page-range>33-40</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kobus]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Yvon]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Damnati]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[Transcrire les SMS comme on reconnaît la parole]]></source>
<year>2008</year>
<conf-name><![CDATA[ Actes de la Conférence sur le Traitement Automatique des Langues (TALN&#8217;08)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>128-38</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Han]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Baldwin]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Lexical normalisation of short text messages: Makn sens a# twitter]]></source>
<year>2011</year>
<conf-name><![CDATA[ 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1]]></conf-name>
<conf-loc> </conf-loc>
<page-range>368-78</page-range></nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bertaglia]]></surname>
<given-names><![CDATA[T. F. C.]]></given-names>
</name>
<name>
<surname><![CDATA[Nunes]]></surname>
<given-names><![CDATA[M. d. G. V.]]></given-names>
</name>
</person-group>
<source><![CDATA[Exploring Word Embeddings for Unsupervised Textual User-Generated Content. Normalization]]></source>
<year>2016</year>
<conf-name><![CDATA[ 26th Int&#8217;l Conf. Computational Linguistics (COLING 16)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>112&#8212;120</page-range></nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Eryi&#487;it]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Toruno&#487;lu-Selamet]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Social media text normalization for Turkish]]></article-title>
<source><![CDATA[Natural Language Engineering]]></source>
<year>2017</year>
<page-range>1-41</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yan]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
<name>
<surname><![CDATA[Li]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Fan]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum]]></article-title>
<source><![CDATA[Information Discovery and Delivery]]></source>
<year>2017</year>
<volume>45</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>181-93</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mikolov]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Chen]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Corrado]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[Efficient estimation of word representations in vector space]]></source>
<year>2013</year>
<publisher-name><![CDATA[ICLR Workshop Papers]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Salameh]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Mohammad]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Kiritchenko]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Sentiment after Translation: A Case-Study on Arabic Social Media Posts]]></source>
<year>2015</year>
<conf-name><![CDATA[ conference the North American chapter of the association for computational linguistics: Human language technologies]]></conf-name>
<conf-loc> </conf-loc>
<page-range>767-77</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Htait]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Fournier]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Bellot]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
</person-group>
<source><![CDATA[LSIS at SemEval Task 4: Using Adapted Sentiment Similarity Seed Words For English and Arabic Tweet Polarity Classification]]></source>
<year>2017</year>
<conf-name><![CDATA[ 11th International Workshop on Semantic Evaluation (SemEval-2017)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>718-22</page-range></nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rosenthal]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Nakov]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Ritter]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Stoyanov]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<source><![CDATA[SemEval-2014 Task 9: Sentiment analysis in Twitter]]></source>
<year>2014</year>
<conf-name><![CDATA[ 8th International Workshop on Semantic Evaluation (SemEval 14)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>73-80</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Powers]]></surname>
<given-names><![CDATA[D. M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation]]></article-title>
<source><![CDATA[Journal of Machine Learning Technologies]]></source>
<year>2011</year>
<volume>2</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>37-63</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
