<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462017000400725</article-id>
<article-id pub-id-type="doi">10.13053/cys-21-4-2697</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Pre-Processing of English-Hindi Corpus for Statistical Machine Translation]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Arora]]></surname>
<given-names><![CDATA[Karunesh Kumar]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Agrawal]]></surname>
<given-names><![CDATA[Shyam S.]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Centre for Development of Advanced Computing  ]]></institution>
<addr-line><![CDATA[Noida ]]></addr-line>
<country>India</country>
</aff>
<aff id="Af2">
<institution><![CDATA[,KIIT Group of Institutions  ]]></institution>
<addr-line><![CDATA[Bhondsi Gurugram]]></addr-line>
<country>India</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>12</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>12</month>
<year>2017</year>
</pub-date>
<volume>21</volume>
<numero>4</numero>
<fpage>725</fpage>
<lpage>737</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462017000400725&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462017000400725&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462017000400725&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract: Corpus may be considered as fuel for the data driven approaches of machine translation. Parallel corpus building is a labour intensive task, which makes it a costly and scarce resource. Full potential of available data needs to be exploited and this can be ensured by removing different types of inconsistencies as being faced throughout the NLP domain. The paper presented here describes the experiments carried out on corpus text pre-processing for building the baseline Statistical Machine Translation (SMT) system. Text pre-processing performed here is classified in two stages &#8211; i. the first one relates to handling of orthographic representation of content and ii. the second stage relates to handling of non-lexical words. The first stage covers punctuation symbols, casing, word spellings and their normalization while second stage covers handling of numbers and named entities (NEs) applied on the best settings observed in first stage. The motivation behind performing these experiments was to derive a relationship and gauge the extent of pre-processing the corpus, thereby building a considerably optimized baseline SMT system. This baseline system would provide platform for performing further experiments with different syntactic and semantic factors in future. The findings presented here is for English-Hindi language pair, however, the concept of pre-processing is language neutral and can be transcended to any other language pair. The best performance is reported with retaining the punctuation symbols, lower-cased English corpus and spell normalized Hindi corpus for English to Hindi translation. Further to these, in the second stage of experiments, handling numbers and Named Entities have been described wherein these are mapped to unique class labels. The impact of these experiments have been explained with their appropriateness for the concerned language pair.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Statistical machine translation]]></kwd>
<kwd lng="en"><![CDATA[preprocessing]]></kwd>
<kwd lng="en"><![CDATA[normalization]]></kwd>
<kwd lng="en"><![CDATA[named entity handling]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sproat]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Black]]></surname>
<given-names><![CDATA[A.W.]]></given-names>
</name>
<name>
<surname><![CDATA[Chen]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Kumar]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Ostendrof]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Richards]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Normalization of non-standard words]]></article-title>
<source><![CDATA[Computer Speech and Language]]></source>
<year>2001</year>
<volume>15</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>287&#8211;333</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Caseli]]></surname>
<given-names><![CDATA[H.M.]]></given-names>
</name>
<name>
<surname><![CDATA[Nunes]]></surname>
<given-names><![CDATA[I.A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Statistical Machine Translation: little changes big impacts]]></source>
<year>2009</year>
<conf-name><![CDATA[ Proceedings of 7th Brazilian Symposium in Information and Human Language Technology]]></conf-name>
<conf-loc> </conf-loc>
<page-range>1&#8211;9</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bojar]]></surname>
<given-names><![CDATA[O.]]></given-names>
</name>
<name>
<surname><![CDATA[Stra&#328;ák]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Zeman]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[Data Issues in English-to-Hindi Machine Translation]]></source>
<year>2010</year>
<conf-name><![CDATA[ Proceedings of the 7th International Language Resources and Evaluation (LREC&#8217;10)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>1771&#8211; 1777</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Santanu]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Sudip]]></surname>
<given-names><![CDATA[K.N.]]></given-names>
</name>
<name>
<surname><![CDATA[Pavel]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Sivaji]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Andy]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation]]></source>
<year>2010</year>
<conf-name><![CDATA[ Proceedings of the Workshop on Multiword Expressions: from Theory to Applications, (ACL´10)]]></conf-name>
<conf-loc> </conf-loc>
<page-range>46&#8211;54</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lane]]></surname>
<given-names><![CDATA[I.R.]]></given-names>
</name>
<name>
<surname><![CDATA[Waibel]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Class-based statistical machine translation for field maintainable speech-to-speech translation]]></source>
<year>2008</year>
<conf-name><![CDATA[ Proceedings of International Conference on Speech Communications and Technology]]></conf-name>
<conf-loc> </conf-loc>
<page-range>2362&#8211;2365</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Gómez-Adorno]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Gelbukh]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Adapting Cross-Genre Author Profiling to Language and Corpus]]></article-title>
<source><![CDATA[Working Notes Papers of the (CLEF´10)]]></source>
<year>2016</year>
<volume>1609</volume>
<page-range>947&#8211;955</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Markov]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Stamatatos]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Sidorov]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing]]></source>
<year>2017</year>
<conf-name><![CDATA[ Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2017)]]></conf-name>
<conf-loc>Budapest, Hungary </conf-loc>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sellami]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Deffaf]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Sadat]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Hadrich-Belguith]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Improved Statistical Machine Translation by Cross-Linguistic Projection of Named Entities Recognition and Translation]]></article-title>
<source><![CDATA[Computación y Sistemas]]></source>
<year></year>
<volume>19</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>701&#8211;711</page-range></nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Okuma]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Yamamoto]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Sumita]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Introducing translation dictionary into phrase-based SMT]]></article-title>
<source><![CDATA[IEICE transactions on information and systems]]></source>
<year>2008</year>
<volume>E91-10</volume>
<numero>7</numero>
<issue>7</issue>
<page-range>2051&#8211;2057</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brown]]></surname>
<given-names><![CDATA[P.F.]]></given-names>
</name>
<name>
<surname><![CDATA[Pietra]]></surname>
<given-names><![CDATA[V.J.]]></given-names>
</name>
<name>
<surname><![CDATA[Pietra]]></surname>
<given-names><![CDATA[S.A.D.]]></given-names>
</name>
<name>
<surname><![CDATA[Mercer]]></surname>
<given-names><![CDATA[R.L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[The Mathematics of Statistical Machine Translation: Parameter Estimation]]></article-title>
<source><![CDATA[Computational Linguistics]]></source>
<year>1993</year>
<volume>19</volume>
<page-range>263&#8211;311</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Och]]></surname>
<given-names><![CDATA[F.J.]]></given-names>
</name>
<name>
<surname><![CDATA[Ney]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[The Alignment template approach to statistical machine translation]]></article-title>
<source><![CDATA[Computational Linguistics]]></source>
<year>2004</year>
<volume>30</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>417&#8211; 449</page-range></nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Och]]></surname>
<given-names><![CDATA[F.J.]]></given-names>
</name>
<name>
<surname><![CDATA[Ney]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[A Systematic Comparison of Various Statistical Alignment Models]]></article-title>
<source><![CDATA[Computational Linguistics]]></source>
<year>2003</year>
<volume>29</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>19&#8211;51</page-range></nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Heafield]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<source><![CDATA[KenLM: faster and smaller language model queries]]></source>
<year>2011</year>
<conf-name><![CDATA[ Proceedings of the (EMNLP´11) Sixth Workshop on Statistical Machine Translation]]></conf-name>
<conf-loc> </conf-loc>
<page-range>187&#8211;197</page-range></nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Koehn]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Hoang]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Birch]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Callison-Burch]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Federico]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Bertoldi]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Cowan]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Shen]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
<name>
<surname><![CDATA[Moran]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Zens]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Dyer]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Bojar]]></surname>
<given-names><![CDATA[O.]]></given-names>
</name>
<name>
<surname><![CDATA[Constantin]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Herbst]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
</person-group>
<source><![CDATA[Moses: Open Source Toolkit for Statistical Machine Translation]]></source>
<year>2007</year>
<conf-name><![CDATA[ Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics]]></conf-name>
<conf-loc> </conf-loc>
<page-range>177&#8211; 180</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Papineni]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Roukos]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Ward]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Zhu]]></surname>
<given-names><![CDATA[W.J.]]></given-names>
</name>
</person-group>
<source><![CDATA[BLEU: a method for automatic evaluation of machine translation]]></source>
<year>2002</year>
<conf-name><![CDATA[ Proceedings of the 40th Annual meeting of the Association for Computational Linguistics]]></conf-name>
<conf-loc> </conf-loc>
<page-range>311&#8211;318</page-range></nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Och]]></surname>
<given-names><![CDATA[F]]></given-names>
</name>
</person-group>
<source><![CDATA[Minimum Error Rate Training in Statistical Machine Translation]]></source>
<year>2003</year>
<conf-name><![CDATA[ Proceedings of Association of Computational Linguistics]]></conf-name>
<conf-loc> </conf-loc>
<page-range>160&#8211; 167</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
