<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462019000300883</article-id>
<article-id pub-id-type="doi">10.13053/cys-23-3-3271</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Kaneko]]></surname>
<given-names><![CDATA[Masahiro]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Komachi]]></surname>
<given-names><![CDATA[Mamoru]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Tokyo Metropolitan University Graduate School of Systems Design ]]></institution>
<addr-line><![CDATA[Tokyo ]]></addr-line>
<country>Japan</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>09</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>09</month>
<year>2019</year>
</pub-date>
<volume>23</volume>
<numero>3</numero>
<fpage>883</fpage>
<lpage>891</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462019000300883&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462019000300883&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462019000300883&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Abstract It is known that a deep neural network model pre-trained with large-scale data greatly improves the accuracy of various tasks, especially when there are resource constraints. However, the information needed to solve a given task can vary, and simply using the output of the final layer is not necessarily sufficient. Moreover, to our knowledge, exploiting large language representation models to detect grammatical errors has not yet been studied. In this work, we investigate the effect of utilizing information not only from the final layer but also from intermediate layers of a pre-trained language representation model to detect grammatical errors. We propose a multi-head multi-layer attention model that determines the appropriate layers in Bidirectional Encoder Representation from Transformers (BERT). The proposed method achieved the best scores on three datasets for grammatical error detection tasks, outperforming the current state-of-the-art method by 6.0 points on FCE, 8.2 points on CoNLL14, and 12.2 points on JFLEG in terms of F0.5. We also demonstrate that by using multi-head multi-layer attention, our model can exploit a broader range of information for each token in a sentence than a model that uses only the final layer's information.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Multi-head multi-layer attention]]></kwd>
<kwd lng="en"><![CDATA[grammatical error detection]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Al-Rfou]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Choe]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Constant]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Guo]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Jones]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Character-level language modeling with deeper self-attention]]></article-title>
<source><![CDATA[AAAI]]></source>
<year>2019</year>
<publisher-name><![CDATA[Association for the Advancement of Artificial Intelligence]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Alec]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Karthik]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Tim]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Ilya]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<source><![CDATA[Improving Language Understanding with Unsupervised Learning]]></source>
<year>2018</year>
<publisher-name><![CDATA[OpenAI]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Belinkov]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Durrani]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Dalvi]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Sajjad]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Glass]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[What do neural machine translation models learn about morphology?]]></article-title>
<source><![CDATA[ACL]]></source>
<year>2017</year>
<page-range>861-72</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Devlin]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Chang]]></surname>
<given-names><![CDATA[M.-W.]]></given-names>
</name>
<name>
<surname><![CDATA[Lee]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Toutanova]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<source><![CDATA[BERT: Pre-training of deep bidirectional transformers for language understanding]]></source>
<year>2018</year>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kaneko]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Sakaizawa]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Komachi]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Grammatical error detection using error- and grammaticality-specific word embeddings]]></article-title>
<source><![CDATA[IJCNLP]]></source>
<year>2017</year>
<page-range>40-8</page-range><publisher-name><![CDATA[Asian Federation of Natural Language Processing]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kasewa]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Stenetorp]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Riedel]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Wronging a right: Generating better errors to improve grammatical error detection]]></article-title>
<source><![CDATA[EMNLP]]></source>
<year>2018</year>
<page-range>4977-83</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kingma]]></surname>
<given-names><![CDATA[D. P.]]></given-names>
</name>
<name>
<surname><![CDATA[Ba]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Adam: A method for stochastic optimization]]></source>
<year>2015</year>
<conf-name><![CDATA[ ICLR]]></conf-name>
<conf-loc> </conf-loc>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Nagata]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Nakatani]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Evaluating performance of grammatical error detection to maximize learning effect]]></article-title>
<source><![CDATA[COLING]]></source>
<year>2010</year>
<page-range>894-900</page-range><publisher-name><![CDATA[Coling 2010 Organizing Committee]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Napoles]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Sakaguchi]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Tetreault]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[JFLEG: A fluency corpus and benchmark for grammatical error correction]]></article-title>
<source><![CDATA[EACL]]></source>
<year>2017</year>
<page-range>229-34</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ng]]></surname>
<given-names><![CDATA[H. T.]]></given-names>
</name>
<name>
<surname><![CDATA[Wu]]></surname>
<given-names><![CDATA[S. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Briscoe]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Hadiwinoto]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Susanto]]></surname>
<given-names><![CDATA[R. H.]]></given-names>
</name>
<name>
<surname><![CDATA[Bryant]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[The CoNLL-2014 shared task on grammatical error correction]]></article-title>
<source><![CDATA[CoNLL: Shared Task]]></source>
<year>2014</year>
<page-range>1-14</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Peters]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Neumann]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Iyyer]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Gardner]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Clark]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Lee]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
<name>
<surname><![CDATA[Zettlemoyer]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Deep contextualized word representations]]></article-title>
<source><![CDATA[NAACL]]></source>
<year>2018</year>
<page-range>2227-37</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Peters]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Neumann]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Zettlemoyer]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Yih]]></surname>
<given-names><![CDATA[W.-t.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Dissecting contextual word embeddings: Architecture and representation]]></article-title>
<source><![CDATA[EMNLP]]></source>
<year>2018</year>
<page-range>1499-509</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rei]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Semi-supervised multitask learning for sequence labeling]]></article-title>
<source><![CDATA[ACL]]></source>
<year>2017</year>
<page-range>2121-30</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rei]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Felice]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Yuan]]></surname>
<given-names><![CDATA[Z.]]></given-names>
</name>
<name>
<surname><![CDATA[Briscoe]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Artificial error generation with machine translation and syntactic patterns]]></article-title>
<source><![CDATA[BEA]]></source>
<year>2017</year>
<page-range>287-92</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rei]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Sogaard]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Jointly learning to label sentences and tokens]]></article-title>
<source><![CDATA[AAAI]]></source>
<year>2019</year>
<publisher-name><![CDATA[Association for the Advancement of Artificial Intelligence]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rei]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Yannakoudakis]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Compositional sequence labeling models for error detection in learner writing]]></article-title>
<source><![CDATA[ACL]]></source>
<year>2016</year>
<page-range>1181-91</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Srivastava]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Hinton]]></surname>
<given-names><![CDATA[G. E.]]></given-names>
</name>
<name>
<surname><![CDATA[Krizhevsky]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Sutskever]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Salakhutdinov]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Dropout: A simple way to prevent neural networks from overfitting]]></article-title>
<source><![CDATA[Journal of Machine Learning Research]]></source>
<year>2014</year>
<volume>15</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1929-58</page-range></nlm-citation>
</ref>
<ref id="B18">
<label>18</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Takase]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Suzuki]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Nagata]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Direct output connection for a high-rank language model]]></article-title>
<source><![CDATA[EMNLP]]></source>
<year>2018</year>
<page-range>4599-609</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B19">
<label>19</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Taylor]]></surname>
<given-names><![CDATA[W. L.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Cloze procedure: A new tool for measuring readability]]></article-title>
<source><![CDATA[Journalism Bulletin]]></source>
<year>1953</year>
<volume>30</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>415-33</page-range></nlm-citation>
</ref>
<ref id="B20">
<label>20</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Vaswani]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Shazeer]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Parmar]]></surname>
<given-names><![CDATA[N.]]></given-names>
</name>
<name>
<surname><![CDATA[Uszkoreit]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Jones]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Gomez]]></surname>
<given-names><![CDATA[A. N.]]></given-names>
</name>
<name>
<surname><![CDATA[Kaiser]]></surname>
<given-names><![CDATA[L. u.]]></given-names>
</name>
<name>
<surname><![CDATA[Polosukhin]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Attention is all you need]]></article-title>
<source><![CDATA[NIPS]]></source>
<year>2017</year>
<page-range>5998-6008</page-range><publisher-name><![CDATA[Curran Associates, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B21">
<label>21</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wu]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Schuster]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Chen]]></surname>
<given-names><![CDATA[Z.]]></given-names>
</name>
<name>
<surname><![CDATA[Le]]></surname>
<given-names><![CDATA[Q. V.]]></given-names>
</name>
<name>
<surname><![CDATA[Norouzi]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Macherey]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
<name>
<surname><![CDATA[Krikun]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Cao]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Gao]]></surname>
<given-names><![CDATA[Q.]]></given-names>
</name>
<name>
<surname><![CDATA[Macherey]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<source><![CDATA[Google's neural machine translation system: Bridging the gap between human and machine translation]]></source>
<year>2016</year>
</nlm-citation>
</ref>
<ref id="B22">
<label>22</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yannakoudakis]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Briscoe]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Medlock]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[A new dataset and method for automatically grading ESOL texts]]></article-title>
<source><![CDATA[ACL: Human Language Technologies]]></source>
<year>2011</year>
<page-range>180-9</page-range><publisher-name><![CDATA[Association for Computational Linguistics]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
