Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

Pham, Thuong-Hai; Macháček, Dominik; Bojar, Ondřej; Pham, Thuong-Hai; Macháček, Dominik; Bojar, Ondřej

doi:10.13053/cys-23-3-3265

Serviços Personalizados

Journal

Artigo

Indicadores

Citado por SciELO
Acessos

Links relacionados

Similares em SciELO

Mais
Mais

Permalink

Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Comp. y Sist. vol.23 no.3 Ciudad de México Jul./Set. 2019 Epub 09-Ago-2021

https://doi.org/10.13053/cys-23-3-3265

Articles of the Thematic Issue

Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

Thuong-Hai Pham¹

Dominik Macháček¹

Ondřej Bojar¹^*

^¹ Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czech Republic. pham@ufal.mff.cuni.cz, pham@ufal.mff.cuni.cz, machacek@ufal.mff.cuni.cz, bojar@ufal.mff.cuni.cz.

Abstract

The utility of linguistic annotation in neural machine translation seemed to had been established in past papers. The experiments were however limited to recurrent sequence-to-sequence architectures and relatively small data settings. We focus on the state-of-the-art Transformer model and use comparably larger corpora. Specifically, we try to promote the knowledge of source-side syntax using multi-task learning either through simple data manipulation techniques or through a dedicated model component. In particular, we train one of Transformer attention heads to produce source-side dependency tree. Overall, our results cast some doubt on the utility of multi-task setups with linguistic information. The data manipulation techniques, recommended in previous works, prove ineffective in large data settings. The treatment of self-attention as dependencies seems much more promising: it helps in translation and reveals that Transformer model can very easily grasp the syntactic structure. An important but curious result is, however, that identical gains are obtained by using trivial "linear trees" instead of true dependencies. The reason for the gain thus may not be coming from the added linguistic knowledge but from some simpler regularizing effect we induced on self-attention matrices.

Keywords: Syntax; Transformer NMT; Multi-Task NMT

1 Introduction

Neural machine translation (NMT) has dominated the field of MT and many works are emerging that document that the quality of NMT can be, under some circumstances, further improved by incorporating linguistic information from the source and/or target side. Experiments so far were however limited to the recurrent sequence-to-sequence architectures ^[⁶^,²^].

The recent WMT evaluations ^[⁵^,³^] and ^[²⁷^] show that the novel Transformer architecture ^[⁴⁰^] has set the new benchmark and it is thus interesting to see if providing this architecture with linguistic information is equally helpful or if Transformer already models the phenomena unsupervised.

We experiment with German-to-Czech and Czech-to-English translation and focus on source-side dependency annotation using multi-task techniques. We try two ways of forcing the model to consider source syntax: (1) by linearizing the syntactic tree and mixing the translation and parsing training examples, and (2) by adding a secondary objective to interpret one of the attention heads as the syntactic tree.

In Section 2, we survey recent experiments with incorporating linguistic information into NMT, focusing particularly on works which use multi-task learning strategies and on works which consider the syntactic analysis of the sentence. A brief description of the data and common settings of our experiments is provided in Section 3. In Section 4, we explore the simple technique of multi-task by alternating training examples of the individual tasks, discussing also the "cost" of multi-tasking in terms of training steps. Section 5 presents the other approach, interpreting the self-attention matrix in the Transformer architecture as the dependency tree of the source sentence. Here we also add the contrastive experiment with dummy diagonal parses. Section 6 discusses the observations and we conclude in Section 7.

2 Related Work

The idea of multi-task training is to benefit from inherent and implicit similarities between two or more machine learning tasks. If the tasks are solved by a joint model with fewer or more parameters shared among the tasks, the model should exploit the commonalities and perform better in one or more tasks. This improvement can come from various sources, including the additional (often different) training data used in the additional tasks or some form of regularization or generalization that the other tasks promote.

In machine translation, multitasking has brought interesting results in multi-lingual MT systems and also in using additional linguistic annotation ^[²⁰^,⁴⁴^,¹⁰^,¹²^]. Similarly, ^[³⁵^] incorporate linguistic annotation to semantic role labeling task.

^[⁹^] combined translation and dependency parsing by sharing the translation encoder hidden states with the buffer hidden states in a shift-reduce parsing model ^[⁸^]. Aiming at the same goal, ^[¹^] proposed a very simple method. Instead of modifying the model structure, they represented the target sentence as a linearized lexicalized constituency tree. Subsequently, a sequence-to-sequence (seq2seq) model ^[³⁶^] was used to translate the source sentence to this linearized tree, i.e. indeed performing the two tasks: producing the string of the target sentence jointly with its syntactic analysis. ^[¹⁹^] use the same trick for (target-side) dependency trees, proposing a tree-traversal algorithm to linearize the dependency tree. Unfortunately, their algorithm was limited to projective trees.

In parallel to our work, ^[¹³^] examined various scheduling strategies for a very simple approach to multi-tasking: representing all the tasks converted to a common format of source and target sequences of symbols from a joint vocabulary and training one sequence-to-sequence system on the mix of training examples from the different tasks. The scheduling strategy specified the proportion of the tasks in training batches in time. ^[¹³^] report improvements in BLEU score ^[²⁶^] in small-data setting for German-to-English translation for all multi-task setups (translation combined with POS tagging and/or source-side^¹ parsing). In the "standard" data size and the opposite translation direction, results are mixed and only one of the scheduling strategies and only the POS secondary task help to improve MT over the baseline.

The papers mentioned so far targeted primarily the quality of MT (as measured by BLEU), not the secondary tasks. ^[¹³^] note that their system performs reasonably well in both tagging and parsing. ^[³¹^] present an in-depth analysis of the syntactic knowledge learned by the recurrent sequence-to-sequence NMT. ^[³⁹^] are the first to use Transformer and observe that the recurrence is indeed important to model hierarchical structures.

^[²²^] benefit from CCG tags ^[³³^] added to NMT on the source side in the form of word factors and on the target side by interleaving the CCG tags and target words. The additional information proves useful when the CCG tags and words are processed in sync. ^[³⁷^] report similar success in interleaving words and morphological tags.

3 Data and Common Settings

Experiments in this paper are based on two language pairs: German-to-Czech (de2cs) and Czech-to-English (cs2en).

German-to-Czech (de2cs) translation is trained on Europarl ^[¹⁶^] and OpenSubtitles2016 ^[³⁸^]. These are the only publicly available parallel data for this language pair. We carry out some necessary cleanup preprocessing, character normalization and tokenization.

Czech-to-English (cs2en) translation is trained on a subset of CzEng 1.7 ^[⁴^].^² The data sizes used for MT training are summarized in Table 1.

Table 1 Data used in our experiments. Test and dev data for de2cs originate in WMT newstests. For cs2en, we use only a small portion of CzEng, as indicated by the section numbers (#..)

Dataset	de2cs	cs2en
Train sent. Pairs	8.8M	#00-#08: 5.2M
Train tokens (src/tgt)	89M/78M	61M/69M
Test sent. Pairs	news 2013: 3k	#09: 10k
Dev sent. Pairs	news 2011: 3k	#09: 1k

For training of parsing tasks, we used the same datasets automatically annotated on source sides. For German source we used UDPipe ^[³⁴^], with the model trained on Universal Dependencies 2.0 (UD, ^[²⁴^]). For Czech source we used the annotation provided in CzEng release, originally created by Treex ^[²⁹^]. This annotation is based on Prague Dependency Treebank (PDT, ^[¹¹^]). For parsing evaluation, we used gold test set from UD and PDT, respectively.

We use several automatic evaluation metrics to assess translation quality: BLEU ^[²⁶^], CharacTER ^[⁴¹^], BEER ^[³²^], and chrF3 ^[³⁰^]. For experiments in Section 4, the BLEU score is cased, implemented within T2T,^³ in Section 5 with sacrebleu.^⁴ For dependency parsing task, we use unlabeled attachment score (UAS).

To assess the significance of the improvement over a given baseline, we use MT-ComparEval ^[¹⁴^], which implemented the paired bootstrap resampling test (confidence level 0.05 or 0.01; ^[¹⁵^]).

4 Simple Alternating Multi-Task

For simple alternating multi-task learning, the input and output of each task are represented as sequences of tokens and the training examples (pairs of sequences) are alternated in the training data. The basic architecture for MT can thus be used without any modifications.

4.1 Approach

The main idea of simple alternating multi-tasking is to represent both tasks in a formally identical way. To indicate which of the tasks should be performed on a given input sequence, we add a special token as the very last symbol of the sentence.

The encoder and decoder of the NMT model are thus shared for all tasks which enables the encoder to learn source language better, but on the other hand it occupies a certain part of the model with task alternation and multiple language models for each task in the decoder.

In our experiments, we mix two tasks: MT and one additional linguistic (see Figure 1) or dummy referential task (see Figure 2). In "DepHeads" task, word forms of nodes' parents in the dependency tree are predicted. We can reconstruct unlabeled dependency tree in a post-processing step^⁵. "De-pLabels" task is tagging with dependency labels, and "DepHeads+DepLabels" is an interleaved combination of the two.

Fig. 1 Sample dependency tree, inputs and expected outputs of linguistic secondary tasks

Fig. 2 Sample inputs and expected outputs of dummy secondary tasks

The training data in multitasking are selected by constant scheduler as in ^[¹³^], with parameter 0.5, which means the trainer alternates between the tasks, in average, after every training step. As ^[¹³^] reminds, this is different from ^[²⁰^] and ^[⁴³^] where the mixing happens at the level of batches and not individual examples.

The experiments here in Section 4 used Google's Tensor2Tensor version 1.2.9, transformer_big_single_gpu hyperparameter set (hidden size 1024, filter size 4096, 16 self-attention heads, 6 layers) with batch size 1500, 60k warmup steps and 100k shared vocabulary provided by T2T's default SubwordTextEncoder. We note that in some cases, the particular variant of subword units and token pre-processing can have a tremendous effect on the final NMT performance ^[²¹^]. Adding such a study is however beyond the scope of this work.

4.2 Training Cost of the Multi-Task

Adding training examples of the secondary task is bound to affect the training throughput and speed.^⁶ The hope is that this extra training cost is worth the gains obtained in the main task. We examine it empirically by comparing the training speed of the baseline run (no multi-task) and several versions of "dummy" multi-task setups as illustrated in Figure 2.

In "CountSrcWords", the system is expected to count the source words and emit the result as one token holding the decimal number. "EnumSrcWords" is similar but the expected output is much easier for the architecture to grasp: the count should be expressed by an appropriate number of copies of the same special output token. In "CopySrc", the system should simply learn to copy the source, which should be very easy for an attentive architecture.

The task identification is clearly marked on input with a special token. To measure its impact on MT quality, we provide an experimental run "MT TaskID", where only one MT task with task identification token is provided.

Figure 3 summarizes the resulting learning curves on the development set. As we supposed, the secondary dummy task was easy to learn but it hurts MT performance. Enumerating and counting full words are very similar tasks in difficulty, the model learned them almost in same time (accuracy of 80% reached in about 200k training steps), but enumerating worsens MT quality much more (BLEU scores around 12 instead of beyond 15). It probably employs bigger part of decoder. A surprising result is that the task identification token on baseline MT data decreases overall MT performance in the long run, see the curve "MT TaskID" in Figure 3.

Fig. 3 Learning curves of the de2cs baseline and dummy secondary tasks over training steps. MT BLEU left, percentage of correct answers for the secondary task right

4.3 Results of Simple Alternating Multi-Task

As Table 2 and Figure 4 (left) indicate, none of simple alternating multi-tasking method with linguistic secondary task outperformed baseline MT on any of our language pairs after the same amount of time. (In our conditions, training steps and training time are easily convertible; 600k training steps correspond to approximately 40 hours.)

Table 2 Automatic scores for MT with multi-task by simple alternation. All experiments are after 600k of training steps. Scores (space-delimited within each cell): BLEU, CharacTER, BEER, and chrF3. Best in bold, second-best slanted. Statistical significance marked as † (p < 0.05) and ‡ (p < 0.01) when compared to the second-best

Model	de2cs		cs2en
Model	dev	test	dev	test
MT Baseline	17.90‡ 60.73 52.30 47.06	19.74‡ 58.60 53.08 48.62	44.92 42.11 63.32 65.34	44.20 41.68 62.70 63.88
MT+DepLabels	16.52 62.67 50.86 45.18	17.87 59.65 51.67 47.01	41.98 43.28 61.80 63.63	41.94 42.54 61.63 61.96
MT+DepHeads	16.36 62.55 50.76 45.21	17.51 62.15 51.29 46.52	40.72 43.75 61.85 62.78	41.10 42.30 61.41 61.68
MT+DepHeads+	13.62 70.25 48.52 43.06	15.45 67.14 49.69 44.79	39.57 45.63 60.50 61.30	40.25 43.63 60.97 61.05
DepLabels

Fig. 4 Learning MT BLEU curves of the de2cs baseline and linguistic secondary tasks over training steps (left) and over MT epochs (right).

However, if we measure the performance on MT training data throughput (the amount of data consumed in training, ^[²⁸^]), we see that the multi-tasking runs achieved the same level as the baseline with less training data. We conclude that the cost for sharing encoder and decoder between two tasks is higher than benefits from additional linguistic resources, but in particularly small data settings multi-tasking may be desirable.

Table 3 shows the comparison between linguistic and dummy secondary tasks. "DepHeads" and "DepLabels" outperformed "CountSrcWords" and all other dummy tasks. This leads us to the conclusion that syntactic information is useful for the model, but the cost of the secondary task cancels all the gains. The yet a little worse result for "MT+DepHeads+DepLabels" confirms this: two additional tasks are more expensive than one.

Table 3 Comparison of BLEU scores at 600k training steps for linguistic and dummy secondary tasks with simple alternating approach

Model	dev	test
MT Baseline	17.90	19.74
MT TaskID	16.53	18.20
MT+DepLabels	16.52	17.87
MT+DepHeads	16.36	17.62
MT+CountSrcWords	15.70	17.51
MT+CopySrc	14.73	16.07
MT+DepHeads+DepLabels	13.62	15.45
MT+EnumSrcWords	12.16	14.04

Table 4 shows the performance in parsing in terms of unlabelled attachment score and/or label accuracy, depending on the available output of the secondary task(s). As the referential parser, we use UDPipe for German, the one which supervised our model. It should be noted that we used the supervision by UDPipe in a non-standard way. Our system (and the referential parser) take raw word tokens on input, while UDPipe is designed to segment multi-word tokens, such as the German zum, into syntactic words, as zu dem, each of which are single nodes in tree. For Czech, we report the score of winner in CoNLL Shared Task 2007 ^[²⁵^], the latest available evaluation on same data. We expect that the state of the art is higher nowadays. The limitation of our model may be the shared decoder and potentially inaccurate automatically annotated training data.

Table 4 Test set scores for parsing source language (German and Czech, resp.) by simple alternation. "label acc" is the accuracy of tagging words with their dependency labels. Best in bold, second-best slanted

Model	de2cs		cs2en
Model	UAS	label acc	UAS	label acc
referential parser	62.87	73.62	86.28	83.38
MT+DepLabels	-	75.40	-	85.01
MT+DepHeads	62.15	-	80.35	-
MT+DepHeads+	54.98	68.44	80.01	83.99
DepLabels

Our systems reach a very similar UAS performance in de2cs (62.87 vs. 62.15) and somewhat worse in cs2en (86.28 vs. 80.35) and they even surpass the referential parser in label accuracy. The three-task system "MT+DepHeads+DepLabels" generally performs worse not only in MT as discussed above but also in the secondary tasks.

5 Promoting Dependency Interpretation of Self-Attention

In this section, we propose a different but similarly simple technique to promote explicit knowledge of source syntax in the model. Our inspiration comes from the neural model for dependency parsing by ^[⁷^]. The model produces a matrix S(u, v) expressing the probability that the word u is the head of v. The construction of this matrix is very similar to the matrix of self-attention weights α in the Transformer model. From this similarity, we speculate that the self-attentive architecture of Transformer NMT has the capacity to learn dependency parsing and we only need to promote a little the particular linguistic dependencies captured in a treebank.

5.1 Model Architecture

Figure 5 illustrates our joint model ("DepParse"). The translation part is kept unchanged. The only difference is that we reinterpret one of the self-attention heads in the Transformer encoder as if it was the dependency matrix S(u, v). The training objective is combined and maximizes (1) the translation quality in terms of cross-entropy of the candidate translation against the reference and (2) the unlabeled attachment score (UAS) of the proposed head against the (automatic) golden parse. The particular choice of the Transformer head which will serve as the dependency parser is arbitrary. Put differently, we constrain the Transformer model to use one of its heads to follow the given syntactic structure of the sentence.

Fig. 5 Joint dependency parsing and translation model ("DepParse")

It would be also possible to use e.g. the deep-syntactic parse of the sentence (the tectogrammatical layer as defined e.g. for the Prague Dependency Treebank, ^[¹¹^]); we leave that for future work.

5.2 Experiment Setup

Experiments in this section were carried out with T2T version 1.5.6 at the word level, i.e. without using subword units. We decided for this simplification for an easier alignment between the translation and parsing tasks.

The Transformer hyper-parameter set transformer_base ^[²⁸^] was used for all model variants with hidden size 512, filter size 2048, 8 self-attention heads and 6 layers in each of the encoder and decoder. From now on, we refer the Transformer model with this hyper-parameter set as "TransformerBase", our baseline. We also experimented with the choice of the encoder layer, which we use for parsing.

In addition to the standard preprocessing for MT, we inserted a special "ROOT" word to the beginning of every sentence, so that the selected self-attention head would be able to represent a dependency tree correctly.

5.3 Layer Choice

Firstly, we experiment with selection of one of the six encoder layers from which we take the self-attention head that will serve as the dependency parse. Table 5 presents the results for both translation and parsing.

Table 5 DepParse's results in translation (BLEU) and parsing (UAS) on automatically annotated (cs2en). All test BLEU gains, except for layer 0, are statistically significant with p < 0.01 when compared to TransformerBase

	BLEU		UAS
	Dev	Test	Dev	Test
TransformerBase	37.28	36.66	-	-
Parse from layer 0	36.95	36.60	81.39	82.85
Parse from layer 1	38.51	38.01	90.17	90.78
Parse from layer 2	38.50	37.87	91.31	91.18
Parse from layer 3	38.37	37.67	91.43	91.43
Parse from layer 4	37.86	37.60	91.65	91.56
Parse from layer 5	37.63	37.67	91.44	91.46

It is apparent that layer 0 (the first layer) is a too early stage for both tasks. The self-attention mechanism has only access to input word embeddings, and their relations are very likely to be useful semantically rather than syntactically. On the other hand, layers 1 and 2 perform well in parsing, and they are the best layers for translation quality. A possible explanation is that they already have sufficient information for a reasonably precise parse and do not consume the encoder's capacity for translation.

Further layers perform generally better and better in parsing (because they are more informed) and maintain a solid performance in translation, but the translation quality is slowly decreasing. For the following, we select layer 1 to demand syntactic information from.

5.4 Performance in Translation

Table 6 compares the performance of the baseline Transformer, the simple alternating setup from Section 4 (DepHeads src) and the multi-task setup from this section. All these runs use T2T version 1.5.6 and use words, not subword units. This also explains the decrease in BLEU compared to Table 2. The DepParse model significantly outperforms the baseline (38.01 vs. 36.66 and 14.27 vs. 13.96).

Table 6 BLEU scores on test set for translation task (T2T 1.5.6, word level). Statistical significance marked as † (p < 0.05) and ‡ (p < 0.01) when compared to TransformerBase

Model	de2cs	cs2en
TransformerBase	13.96	36.66
Alternating multi-tasking (Section 4)	12.85	36.47
DepParse (Section 5)	14.27†	38.01‡

5.5 Performance in Parsing

In addition to the automatically annotated dev and test set, we also evaluated our model on the gold evaluation sets from UD 2.0 for German and from PDT 2.5 for Czech. The referential parsers were the same as for Table 4 above. Table 7 shows that our model achieved good results in comparison to the baseline model on those datasets, even though ours was trained using synthetic data.

Table 7 UAS on gold annotated test sets for parsing task

Model	de2cs	cs2en
Referential parsers	62.87	86.28
DepParse	76.48	82.53

5.6 Diagonal Parse

For contrast, we conduct an experiment with a simpler sentence structure, which we call the "diagonal parse". In the diagonal parse, the dependency head of a token is simply the previous token, i.e. a linear tree, as illustrated in Figure 6.

Fig. 6 Dummy dependencies with diagonal matrix (the columns represent the heads, the rows are dependents)

Our model for the joint diagonal parsing and translation ("DiagonalParse") is identical to the "DepParse" model, which has been described in Section 5.1. We only use diagonal matrices during training, instead of the dependency matrices. The main goal of this setup is to examine the benefits of the additional syntactic information for machine translation.

Table 8 documents that the "DiagonalParse" model is very effective. The diagonal parsing precision is, as expected, very high, ranging from 99.95% to 99.99% on the test set. This joint model also outperformed the baseline in translation task with all its variants (BLEU scores vary from 37.47 to 38.14, compared to 36.66).

Table 8 DiagonalParse's results in translation (BLEU) and diagonal parsing (precision) on cs2en. All test BLEU improvements are statistically significant with p < 0.01 when compared to the TransformerBase

	BLEU		Precision
	Dev	Test	Dev	Test
TransformerBase	37.28	36.66	-	-
Parse from layer 0	38.68	38.14	99.97	99.96
Parse from layer 1	39.11	38.06	99.99	99.99
Parse from layer 2	37.85	37.85	99.98	99.98
Parse from layer 3	37.93	37.70	99.97	99.98
Parse from layer 4	37.68	37.47	99.98	99.96
Parse from layer 5	37.53	37.54	99.96	99.95

Moreover, these results form an observable pattern, in which the best result comes from the model with parsing with the head on layer 0. Parsing from deeper layers still helps to improve translation over baseline, but the BLEU scores decrease. We believe that a possible explanation for this pattern is that the diagonal matrix represents the relation between the preceding token and the current token. This simple sentence structure can serve as an additional positional information to the absolute positional embeddings. Therefore, the sooner the model is forced to recognize this positional information (via training the parsing task), the better it can learn to translate. Another possible explanation is the regularization effect of the diagonal parse.

5.7 Training Speed

As seen above, the secondary objective for both true dependency tree as well as for the diagonal parse help to improve translation quality in terms of BLEU score.

It is worth noting that this extension of the model, the secondary objective, comes at a rather small cost in training time compared to the baseline Transformer.

The training time (including internal evaluation every 1000 steps) on a single GPU NVIDIA GTX 1080 Ti needed to reach 250k steps for TransformerBase was about 1 day and 4 hours while our joint models needed only 10%-13% more time to train on both tasks.

5.8 Self-Attention Patterns in the Encoder

Figure 7 presents the behavior of self-attention mechanism in each layer of our models for the first 100 sentences in the test set. We summarize the self-attention across the sentences and Transformer heads of a given layer using a histogram of observed self-attention weights. Most cells in the attention matrix indicate no attention, so the bin [0.0,0.1) always receives the highest value in the histogram. For clarity of the picture, we exclude this bin and focus on other observed values of attention weights.

Fig. 7 Histogram of normalized self-attention weights in the encoder

As can be seen from Figure 7, the Transformer base encoder shows very similar distribution of attention weights across all its six layers: most cells have low values. The distributions are very different, when one of the heads at a given layer is trained to perform source-side parsing, i.e. to assign exactly one governor to each word in the input sentence. With this constraint, the attention distribution at the particular layer is peaked, with each position attending clearly to only one or two positions from the previous layer.

This behavior is apparent in all our multi-task models except the "Parse from layer 0", where we see a mix of the baseline and the peaked pattern. As mentioned in Section 5.4, this model performed badly on both tasks. While the causality is unclear, we at least see that the sharpness in attention is related to the better performance.

Figure 8 documents another interesting observation (as above, the bin[0.0,0.l) was excluded). One could perhaps expect such a sharp attention from the one particular head which was trained to predict dependencies but interestingly, the same sharpness is observed in all heads of the given layer. A possible reason may be the vector concatenation and layer normalization after each multi-head attention layer in the Transformer.

Fig. 8 Histogram of self-attention weights in the encoder's layer 4 when parsing from layer 4

6 Discussion

^[¹³^] suggest another representation of "Dep-Heads", which doesn't suffer on unknown words and repeated words. They indicate the governing node as an offset from the node's position represented as decimal number, positive to the right, negative to the left. As we documented in Section 4, Transformer can easily learn to count words, so this representation should be considered in future work.

We let aside a question of vocabulary design for multi-tasking. In T2T's SubwordTextEncoder (STE), the vocabulary is constructed automatically from a training data sample, so that frequent words are represented as single subwords and rare words as sequences of characters. We assume that balancing the importance of source and target sides of particular tasks in STE input, thus steering the vocabulary towards more efficient representations of some of these sides, could lead to better quality. This could be further combined with various parameters for the constant task scheduler.

Multiple multi-task experiments (^[²³^], ^[⁴²^], etc.) mention notable gains on small data scenarios. As documented by ^[¹⁷^], under certain training data size, NMT is actually much worse than conventional phrase-based MT. It is unclear if the gains from NMT multi-tasking are obtained also after this critical corpus size, or if they are limited to the data sizes where NMT is not so effective.

The observation in Section 5.6 that a very similar gain BLEU can be achieved using either true syntactic trees or dummy diagonal parse is casting doubt on the utility of explicit linguistic information for Transformer models. While we keep the stance that linguistic generalization will be useful for translation quality in the long term, our results cannot confirm this yet. We are nevertheless happy that we included this dummy experiment and opened this concern, rather than confidently claiming that source syntax helps Transformer. We share the opinion with ^[¹⁸^] that dummy baselines are critically needed for trustworthy progress in NMT.

One limitation of our setup was that our model was trained on automatic parses. Hence, it would be interesting to fine-tune our model with gold-annotated trees, which could lead to a better parsing performance. We leave this for future work.

7 Conclusion

We proposed two techniques of promoting the knowledge of source syntax in the Transformer model of NMT by multi-tasking and evaluated them at reasonably large data sizes.

The simple data manipulation technique, alternating translation and linearized parsing, is impractical. Learning to translate and parse improves over comparable multi-task setups with uninformative ("dummy") secondary tasks, but overall it performs worse than single-task translation model. In low-resource conditions, the gain from the multi-tasking may be useful.

The other technique, re-interpreting one of the self-attention heads in the Transformer model as the dependency analysis of the sentence, is surprisingly effective. At little or no cost in training time, Transformer learns to translate and parse at the same time. The parse accuracy is reasonable and the translation is significantly better than the baseline. Curiously, very similar gains can be obtained by predicting a "diagonal parse", i.e. linguistically uninformed linear tree. The full explanation of this behavior is yet to be sought for.

Acknowledgments

The research was partially supported by the grants 19-26934X (NEUREM3) of the Czech Science Foundation and H2020-ICT-2018-2-825460 (ELITR) of the EU.

References

1. Aharoni, R. & Goldberg, Y. (2017). Towards string-to-tree neural machine translation. Proceedings of the 55th ACL Meeting, Volume 2: Short Papers, pp. 132-140. [ Links ]

2. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of ICLR. [ Links ]

3. Barrault, L., Bojar, O., Costa-jussa, M. R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Koehn, P., Malmasi, S., Monz, C., Muller, M., Pal, S., Post, M., & Zampieri, M. (2019). Findings of the 2019 Conference on Machine Translation (WMT19). Proceedings of the Fourth Conference on Machine Translation, Association for Computational Linguistics, Florence, Italy. [ Links ]

4. Bojar, O., Dušek, O., Kocmi, T., Libovický, J., Novák, M., Popel, M., Sudarikov, R., & Variš, D. (2016). CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. Sojka, P., Horák, A., Kopeček, I., & Pala, K., editors, Text, Speech, and Dialogue: 19th International Conference, TSD 2016, number 9924 in Lecture Notes in Computer Science, pp. 231-238. [ Links ]

5. Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Koehn, P., & Monz, C. (2018). Findings of the 2018 conference on machine translation (wmt18). Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Belgium, Brussels, pp. 272-307. [ Links ]

6. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724-1734. [ Links ]

7. Dozat, T. & Manning, C. D. (2016). Deep biaffine attention for neural dependency parsing. CoRR, Vol. abs/1611.01734. [ Links ]

8. Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent neural network grammars. HLT-NAACL, pp. 199-209. [ Links ]

9. Eriguchi, A., Tsuruoka, Y., & Cho, K. (2017). Learning to parse and translate improves neural machine translation. Proceedings of the 55th ACL Meeting, Volume 2: Short Papers, pp. 72-78. [ Links ]

10. Ha, T.-L., Niehues, J., & Waibel, A. (2016). Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. Proceedings of the International Workshop on Spoken Language Translation, IWSLT'16, Seattle, USA. [ Links ]

11. Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štĕpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., & Ševčiková Razímová, M. (2006). Prague Dependency Treebank 2.0. [ Links ]

12. Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F. B., Wattenberg, M., Corrado, G., Hughes, M., & Dean, J. (2016). Google's multilingual neural machine translation system: Enabling zero-shot translation. CoRR, Vol. abs/1611.04558. [ Links ]

13. Kiperwasser, E. & Ballesteros, M. (2018). Scheduled Multi-Task Learning: From Syntax to Translation. Transactions of the Association for Computational Linguistics, Vol. 6, pp. 225-240. [ Links ]

14. Klejch, O., Avramidis, E., Burchardt, A., & Popel, M. (2015). MT-ComparEval: Graphical evaluation interface for Machine Translation development. The Prague Bulletin of Mathematical Linguistics, Vol. 104, No. 1, pp. 63-74. [ Links ]

15. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. Proceedings of EMNLP, volume 4, pp. 388-395. [ Links ]

16. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp. 79-86. [ Links ]

17. Koehn, P. & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, pp. 28-39. [ Links ]

18. Kondratyuk, D., Cardenas, R., & Bojar, O. (2019). Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation. The Prague Bulletin of Mathematical Linguistics, Vol. 113, No. 1. [ Links ]

19. Le, A. N., Martinez, A., Yoshimoto, A., & Matsumoto, Y. (2017). Improving sequence to sequence neural machine translation by utilizing syntactic dependency information. Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP, Volume 1: Long Papers, pp. 21-29. [ Links ]

20. Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., & Kaiser, L. (2015). Multi-task sequence to sequence learning. CoRR, Vol. abs/1511.06114. [ Links ]

21. Macháček, D., Vidra, J., & Bojar, O. (2018). Morphological and Language-Agnostic Word Segmentation for NMT. Text, Speech, and Dialogue: TSD 2018, number 11107 in Lecture Notes in Artificial Intelligence, Masaryk University, Cham / Heidelberg / New York / Dordrecht / London, pp. 277-284. [ Links ]

22. Nadejde, M., Reddy, S., Sennrich, R., Dwojak, T., Junczys-Dowmunt, M., Koehn, P., & Birch, A. (2017). Predicting target language ccg supertags improves neural machine translation. Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper, pp. 68-79. [ Links ]

23. Niehues, J. & Cho, E. (2017). Exploiting linguistic resources for neural machine translation using multitask learning. WMT. [ Links ]

24. Nivre, J., Agic, Ž., Ahrenberg, L., Antonsen, L., & et al. (2017). Universal dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (UFAL), Faculty of Mathematics and Physics, Charles University. [ Links ]

25. Nivre, J., Hall, J., Kübler, S., McDonald, R. T., Nilsson, J., Riedel, S., & Yuret, D. (2007). The conll 2007 shared task on dependency parsing. EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pp. 915-932. [ Links ]

26. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 311-318. [ Links ]

27. Popel, M. (2018). CUNI Transformer Neural MT System for WMT18. Proceedings of the Third Conference on Machine Translation, Association for Computational Linguistics, Belgium, Brussels, pp. 486-491. [ Links ]

28. Popel, M. & Bojar, O. (2018). Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics , Vol. 110, No. 1, pp. 43-70. [ Links ]

29. Popel, M. & Žabokrtský, Z. (2010). TectoMT: Modular NLP framework. Loftsson, H., Rögnvaldsson, E., & Helgadottir, S., editors, Proceedings of the 7th International Conference on Advances in Natural Language Processing (IceTAL 2010), volume 6233 of Lecture Notes in Computer Science, Springer, pp. 293-304. [ Links ]

30. Popovic, M. (2015). chrf: character n-gram f-score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, Lisbon, Portugal, pp. 392-395. [ Links ]

31. Shi, X., Padhi, I., & Knight, K. (2016). Does string-based neural MT learn source syntax? Su, J., Carreras, X., & Duh, K., editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2016, The Association for Computational Linguistics, pp. 1526-1534. [ Links ]

32. Stanojević, M. & Sima'an, K. (2014). Fitting sentence level translation evaluation with many dense features. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 202-206. [ Links ]

33. Steedman, M. (2000). The Syntactic Process. MIT Press, Cambridge, MA, USA. [ Links ]

34. Straka, M. & Straková, J. (2017). Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, pp. 88-99. [ Links ]

35. Strubell, E., Verga, P., Andor, D., Weiss, D., & McCallum, A. (2018). Linguistically-informed self-attention for semantic role labeling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp. 5027-5038. [ Links ]

36. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NIPS, pp. 3104-3112. [ Links ]

37. Tamchyna, A., Weller-Di Marco, M., & Fraser, A. (2017). Modeling target-side inflection in neural machine translation. Proceedings of the Second Conference on Machine Translation, Volume 1: Research Paper, pp. 32-42. [ Links ]

38. Tiedemann, J. (2009). News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In Proc. of RANLP, volume V. Borovets, Bulgaria, pp. 237-248. [ Links ]

39. Tran, K. M., Bisazza, A., & Monz, C. (2018). The importance of being recurrent for modeling hierarchical structure. CoRR, Vol. abs/1803.03585. [ Links ]

40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000-6010. [ Links ]

41. Wang, W., Peter, J.-T., Rosendahl, H., & Ney, H. (2016). Character: Translation edit rate on character level. Proceedings of the First Conference on Machine Translation, pp. 505-510. [ Links ]

42. Zaremoodi, P. & Haffari, G. (2018). Neural machine translation for bilingually scarce scenarios: A deep multi-task learning approach. [ Links ]

43. Zoph, B. & Knight, K. (2016). Multi-Source Neural Translation. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, pp. 30-34. [ Links ]

44. Zoph, B., Yuret, D., May, J., & Knight, K. (2016). Transfer learning for low-resource neural machine translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568-1575. [ Links ]

¹^[¹³^] do not explicitly state whether they use the source or the target language treebank as the training data for the parsing task. While both is actually possible, and while even the combination of both could be tried, we assume they used the source-side treebank only.

² http://ufal.mff.cuni.cz/czeng

³ https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py

⁴ https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu

⁵If one word form appears multiple times in a sentence, we attach the edge to the nearest option. We propose this approach mostly for annotation schemes, in which content words (in contrast to function words) appear as inner nodes of dependency trees. Since content words are usually not repeated in sentences, there is a low chance they will be mismatched.

⁶We adopt the terminology of ^[²⁸^].

Received: February 23, 2019; Accepted: March 04, 2019

^* Corresponding author is Ondřej Bojar. bojar@ufal.mff.cuni.cz

This is an open-access article distributed under the terms of the Creative Commons Attribution License