1 Introduction
Dependency parsing consists of graph-based and transition-based parser (Kubler et al., 2009). Given sentence s, a graph-based algorithm finds the highest scoring parse tree from all possible outputs while a transition-based algorithm builds a parse by a sequence of actions. In recent years, many researchers have developed deep learning approaches with high accuracy in English, Chinese, etc. Chen and Manning proposed a novel way of learning a neural network classifier in a greedy, transition-based dependency parser which achieved USA=92.2% and LSA=89.7% on the English Penn Treebank [1].
Dyer et al. (2015) [3] also presented stack LSTMs, recurrent neural networks for sequences, with push and pop operations, and used them to implement a state-of-the-art of transition-based dependency parser with USA=93.2% and LSA=90.9% in English. Kiperwasser et al. (2016) [5] presented a simple and effective scheme for dependency parsing based on bidirectional-LSTMs (BiLSTMs) which had USA=93.8% and LSA=91.5% for English. Besides, Dozat and Manning (2016) [2] have recently inherited from Kiperwasser et al. using neural attention in a simple graph-based dependency parser. Their parser gained a state-of-the-art or its performance on standard treebanks in six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset.
Regarding Vietnamese dependency parsing, there have been many contributions to parsing. In 2008, Nguyễn Lê Minh et al. [12] used MST parser on a corpus consisting of 450 sentences. Then, in 2012, Phuong Le et al. [6] applied a lexicalized tree-adjoining grammar parser trained on a subset of the Vietnamese treebank. In 2013, Thi-Luong et al. [18] used MaltParser on a Vietnamese dependency treebank which is converted automatically from a Vietnamese treebank. One year later, Dat et al. [14] also presented a new conversion method to automatically transform a constituent-based Vietnamese Treebank into dependency trees.
In 2015, Phuong Le et al. [8] improved accuracy of Vietnamese dependency parsing, used distributed word representations with Skip-gram and GloVe model for transition-based dependency parsing. In 2016, Thi-Luong et al. [16] also used distributed word representations with Skip-gram in graph-based dependency parsing for Vietnamese and Dat et al. [13] presented an empirical study for Vietnamese dependency parsing. In 2017, Kiem Hieu [15] presented their work on building BKTreebank, a dependency treebank for Vietnamese.
1.1 Transition-Based Dependency Parsing
The transition system has a set of configurations and a set of transitions which are applied to configurating. By parsing a sentence, the system is initialized to an initial configuration based on the input sentence, and transitions are repeatedly applied to this configuration. After a finite number of transitions, the system arrives at a terminal configuration, and a parse tree is read off the terminal configuration. In a greedy parser, a classifier is used to choose the transition and take in each configuration, based on features extracted from the configuration itself. The parsing algorithm is presented in Algorithm 1 below.
Many transition-based systems [7] are popular such as arg-eager algorithm, arg-standard algorithm. However in this work, we employ the arc-hybrid system which is similar to these. In the arc-hybrid system, a configuration c = (α, β, T) consists of a stack α, a buffer β, and a set T of dependency arcs.
Both the stack and the buffer hold integer indices pointing to sentence elements. Given a sentence s = w1, w2,..., wn, the system is initialized with an empty stack, an empty arc set, and β = 1,..., n, ROOT, where ROOT is the special root index. Any configuration c with an empty stack and a buffer containing only ROOT is terminal, and the parse tree is given by the arc set Tc of c. The arc-hybrid system allows 3 possible transitions, SHIFT, LEFT and RIGHT, defined as:
1.2 Graph-Based Dependency Parsing
The second approach is the graph-based dependency parsing algorithm introduced by McDonald et al. [11]. In this algorithm, the weights of the edges are calculated for building dependency graphs of a sentence as follows:
where w is the weight of the (i, j) edge, f(i, j) is feature of (i, j) edge. The weight of (i, j) edge represents the ability to create a dependency between the head (wi) and the dependence (wj). If the arc score function is known, then the weight of graph is:
Then, based on the weights of all edges in graph, McDonald et al. [10] showed that this problem is equivalent to finding the highest scoring directed spanning tree for the graph G originating out of the root node 0.
1.3 Long Short-Term Memory
Recurrent Neural Network. The recurrent neural network (RNN) is a class of artificial neural network designed for sequence labeling task. It takes input as a sequence of vector and returns another sequence. The simple architecture of RNN has an input layer x, hidden layer h and output layer y. At each time step t, the values of each layer are computed as follows:
where Wih, Whh and Who are the three connection weight matrices and fh and fo that are sigmoid and softmax are the hidden and output unit activation functions.
Long Short-Term Memory. Long Short-Term Memory (LSTM) was first proposed in 1997 by Sepp Hochreiter et al. [4]. LSTM is an extended model of RNN which is designed to combat with these vanishing and exploding gradient problems when learning with long-range sequences. LSTM networks are the same as RNN, except that the hidden layer updates are replaced by memory cells. Figure 1 shows a LSTM cell, including i, f, o are the input,forget and output gates, respectively. c and
where σ is the element-wise sigmoid function and ⊙ is the element-wise product, i, f, o and c are the input gate, forget gate, output gate, and cell vector respectively. Ui, Uf, Uc, Uo are connection weight matrices between input x and gates, and Wi, Wf, Wc, Wo are connection weight matrices between gates and hidden state h.
Bidirectional Long Short-Term Memory. The original LSTM uses only previous contexts for prediction. For many sequence labeling tasks, it is advisable to take the contexts from two directions. Bidirectional LSTM utilizes both the previous and future context by processing the sequence in two directions, and generate two independent sequences of LSTM output vectors.
2 Approach
2.1 Universal Dependency Parsing in Vietnamese
2.1.1 Universal Dependency
The dependency label represents the dependence between the two words in the sentence. Each pair of words, in different positions, will have a different dependency label. There is a general conversion rule to do the dependency label which is uniform throughout the language. There are many sets of relational labels for a language which are different from each others.
The Universal dependencies - UD1 was developed by the Stanford University team, Marneffe et al. [9]. This is a project developed based on the treebank annotation for multi-language, with the goal of facilitating the development of multilingual parsing, cross-language learning, research and analysis from the perspective of the type of language. This project was developed based on the Stanford Dependency - SD dependency labels, also by the Stanford University team (Marneffe et al., 2015) based on multi-lingual labels (Petrov et al., 2012) and the magnetic word form (Zeman, 2008).
The general objective of developing Universal dependencies is to provide a labels set and guidelines to facilitate the construction of of similar works for other languages and allow expansion to a new language. The labels in SD are organized in groups of subject, object, clauses, word definitions, or nouns. Stanford offers nearly 50 types of English dependencies based-on PennTreebank corpus. All of these dependencies are twofold: between a head word and its dependent word. Each relation is given by three components: dependency label, head word and dependent word.
Universal dependencies can be applied to many different languages, which can be used to suggest improvements in dependency parsing, even for English. This research team has developed a core label set that has been extensively tested in a variety of languages, meaning that this core label set can be applied in many different languages. It is also possible to add new labels as needed by categorizing special linguistic relationships, or for individual cases of one or more groups of languages. This label set may correspond to many different languages such as English, French, German, Chinese.... This label is useful because it can indicate a dependency for the same sentence, in different languages.
Universal dependencies contain 40 labels that were organized to allow principles of the UD taxonomy such that rows correspond to functional categories in relation to the head (core arguments of clausal predicates, non-core dependents of clausal predicates, and dependents of nominals) while the columns correspond to structural categories of the dependent (nominals, clauses, modifier words, function words) as in Table 1. All of Universal dependencies are defined and there are specific examples that can use to develop and build a complete label for the others language.
Nominals | Clauses | Modifier words | Function Words | |
---|---|---|---|---|
Core arguments | nsubj | csubj | ||
obj | ccomp | |||
iobj | xcomp | |||
Non-core arguments | nsubj | csubj | ||
obl | advcl | advmod | aux | |
vocative | discourse | |||
expl | ||||
dislocated | ||||
Coordination | MWE | Loose | Special | Other |
conj | fixed | list | orphan | punct |
cc | flat | parataxis | goeswith | root |
compound | reparandum | dep |
2.1.2 Vietnamese Dependencies
Based on universal dependencies and Viettree-bank, we have built Vietnamese dependencies. This set has labels that coincide with the labels in the UD and several new labels. The Vietnamese dependencies set has 46 labels. Some of the dependent labels that we have designed specifically for Vietnamese:
-
— csubj: asubj (adjective subject: A adjective subject is an adjective phrase which is the syntactic subject of a clause. In Vietnamese, the subject is usually a noun (or a noun phrase), but there are some cases adjectives be the subject:
-
— csubj: vsubj (verb subject): This is used to describe the phenomenon as a verb is a subject of a sentence. In Vietnamese, the subject is usually a noun, but there are some cases adjective, verb, clause can do the subject of a sentence:
-
— nc (classifier noun): This relation represents the relationship between a classifier noun with common nouns. The classifier noun always stands before the common noun, for example, “cái”, “con ”...
-
— vnom (verb nominal): This is used for the relationship between a verb moninal and a classifier noun. The classifier noun is always before the verb. Example: “cái”, “sụ”, “việc”,...
Then, we have a comparison between the two sets of labels under Tables 2 and 3.
VD (2016) | UD (2015) | Meaning |
---|---|---|
csubj | csubj | Clausal subject |
csubj:asubj | ||
csubj:vsubj | ||
acomp | xcomp | Adjectival complement |
amod | amod | Adjectival modier |
apredmod | advmod | Adjectival modier of a predicate |
advmod | advmod | Adverbial modier |
advcl | advcl | Adverbial clause modier |
aux | aux | Auxiliary |
auxpass | auxpass | Passive auxiliary |
appos | appos | Appositional modier |
cc | cc | Coordination |
ccomp | ccomp | Clausal complement |
conj | conj | Conjunct |
cop | cop | Copula |
dep | dep | Dependent |
det | det | Determiner |
discourse | discourse | Discourse element |
dislocated | dislocated | Dislocated elements |
dobj | dobj | Direct object |
foreign | foreign | Foreign words |
iobj | iobj | Indirect object |
list | list | List |
mark | mark | Marker |
neg | neg | Negation modier |
VD (2016) | UD (2015) | Meaning |
---|---|---|
nn | compound | Noun compound modier |
nsubj | nsubj | Nominal subject |
num | nummod | Numeric modier |
number | compound | Element of compound number |
parataxis | parataxis | Parataxis |
pcomp | mark | Prepositional complement |
pobj | case | Object of a preposition |
prep | nmod | Prepositional modier |
punct | punct | Punctuation |
remnant | remnant | Remnant in ellipsis |
reparandum | reparandum | Overridden disfluency |
rcmod | acl:relcl | Relative clause modier |
ref | ref | Referent |
root | root | root |
tmod | nmod:tmod | Temporal modier |
vcomp | ccomp | Verb complement of a verb |
vmod | amod:vmod | Verb modier of an NP |
vocative | vocative | Vocative |
xcomp | xcomp | Open clausal complement |
nsubjpass | nsubjpass | Passive nominal subject |
csubjpass | csubjpass | Clausal passive subject |
- | expl | Expletive |
- | goeswith | Goes with |
nc | - | Classifier noun |
vnom | - | Verb nominal |
2.2 BiLSTM in Dependency Parsing
2.2.1 Using BiLSTM Feature Representation
Instead of using direct feature vectors in dependency parsing, we use the same method in [5]. Each of feature vectors by its BiLSTM encoding, and uses a concatenation of a minimal set of such BiLSTM encodings as a feature function, which is then passed to a non-linear scoring function (multi-layer perceptron).
Give input sentence s with n words: w1,..., wn and the corresponding POS tags p1,..., pn . Each word wi and POS pi with embedding vectors e(wi) and e(pi) and denote x1:n is a sequence of input vectors with:
The embedding are trained together with the model. We alse denoted vi is the output of this model. vi is computed as follows:
A Bidirectional LSTM composed of two LSTMs: LSTMf and LSTMb. The LSTMf reads the sequence in its regular order and the LSTMb reads it in reverse. Concretely, given a sequence of vectors x1:n and index i, the function BiLSTMθ(x1:n, i) is defined as:
The feature function φ is then the concatenation of a small number of BiLSTM vectors. The resulting feature vectors are then scored using a non-linear function, namely a multi-layer perceptron with one hidden layer (MLP):
where θ = W2, W1, b2, b1 are the model parameters.
2.2.2 Transition-Based Dependency Parsing uses BiLSTM Feature Representation
Given a sentence s, the transition-based parser is initialized with configuration c. Then, a feature function φ(c) represents the configuration c as a vector. The feature function is the concatenated BiLSTM vectors of the some items on the stack and the buffer. For example, for a configuration c = (...|s2|s1|s0, b0|..., T) the feature extractor is the top 3 items on the stack and the first item on the buffer. It is defined as:
Each transition is scoring using an MLP that is fed the BiLSTM encodings of vectors that are gotten from the feature extractor. Each xi is concatenation of a word and a POS vector. SCORE assigning scores to (configuration, transition) pairs. SCORE scores the possible transition t = Shift, Left_Arc, Right_Arc, and the highest scoring transition
2.2.3 Graph-Based Dependency Parsing uses BiLSTM Feature Representation
In graph-based parsing, the weights of the edges are calculated for building dependency graphs of s = x1:n a sentence as follows:
where space Y(s) of valid dependency trees over s.
Arc-factored parsing decomposes the score of a tree to the sum of the score of its head-modifier arcs (h, m):
where φ(s, h, m) is the feature extractor which uses the BiLSTM encoding of the head word and the modifier word: φ(s, h, m) = BiLSTM(x1:n, h) ◦ BiLSTM(x1:n, m).
The final model is:
3 Experiments
3.1 Datasets
We use the similar database in our research [8, 16, 18]. Text corpus for distributed word representations: To create distributed word representations, we use the dataset consisting of 7.3 GB of text from 2 million articles collected via the Vietnamese news portal. The text is first normalized to lower case. All special characters are removed except these common symbols: the comma, the semi-colon, the colon, the full stop and the percentage sign. All numeral sequences are replaced with the special token <number>, so those correlations between a certain word and a number are correctly recognized by the neural network or the log-bilinear regression model.
Each word in the Vietnamese language may consist of more than one syllable with spaces in between, which could be regarded as multiple words by the unsupervised models. Hence it is necessary to replace the spaces within each word with underscores to create full word tokens. The tokenization process follows the method described in [17]. After removal of special characters and tokenization, the articles add up to 969 million word tokens, spanning a vocabulary of 1.5 million unique tokens. We train the unsupervised models with the full vocabulary to obtain the representation vectors, and then prune the collection of word vectors to the 5.000 most frequent words, excluding special symbols and the token <number> representing numeral sequences.
Dependency treebank. We conduct our experiments on the Vietnamese dependency treebank dataset. This treebank is derived automatically from the constituency-based annotation of the VTB [18], containing 10.471 sentences (225.085 tokens). We manually check the correctness of the conversion on a subset of the converted corpus to come up 3.000 of universal dependency with a training set of 2.200 sentences, a test set of 400 sentences and a dev set of 400 sentences.
3.2 Feature Sets
Feature sets in transition-based: For each parser configuration c = (...|s2|s1|s0, b0|..., T) and transition f(c) in the gold parse. φ(c) is the feature vector representation if the parser configuration c. We denoted part-of-speech tags of token w is p(w). We use the notation tk(w) and e(w) to denote the extracting the word and the distributed representation of the word of token w. rm(w) and lm(w) corresponding to the right-most and left-most modifier of token w. We used the feature templates for the classifier in Table 4. Each feature vtk (w) = p(w)◦tk(w) or ve = p(w)◦e(w) is a feature template of token w.
Feature set | Feature templates |
---|---|
φ0 | vtk(s0), vtk(s1), vtk(s2), vtk(b0) |
φ1 | ve(s0), ve(s1), ve(s2), ve(b0) |
φ2 | φ0, vtk(rm(s0)), vtk(lm(s0)), vtk(rm(s1)), vtk(lm(s1)), vtk(rm(s2)), vtk(lm(s2)), vtk(lm(b0)) |
φ3 | φ1, ve(rm(s0)), ve(lm(s0)), ve(rm(s1)), ve(lm(s1)), ve(rm(s2)), ve(lm(s2)), ve(lm(b0)) |
Feature sets in graph-based: The feature-set proposed by McDonald et al. (2005) with 18 templates for a first-order parser, while the first order feature extractor in the actual implementation’s code (MSTParser2) includes roughly a hundred feature templates. In this case, feature extractor uses merely encoding of the headword and the modifier word with pos and word.
3.3 Vietnamese Dependency Parsing Based-on Bist-Parser
The Bist-parser is a tool, using BiLSTM feature extractors with graph-based and transition-based dependency parsers. This tool was developed by Kiperwasser et al., using BiLSTM feature extractors in Section 2.2.
We use two attachment scores, labeled atta-chment score (LAS) and unlabelled attachment score (UAS) to evaluate the accuracy of the dependency parsing system. Attachment scores are defined as the percentage of correct dependency relations recovered by the parser. A dependency relation is considered correct if both the source word and the target word are correct (UAS), plus the dependency type is correct (LAS).
We also estimate on the Vietnamese dependency treebank [18]. The result is the highest accuracy in Vietnamese dependency parsing as presenting in Table 6.
Feature set | System | Test | |
---|---|---|---|
USA | LSA | ||
φ2 | Transition-based | 82.77% | 76.02% |
Graph-based | 84.05% | 78.35% | |
φ3 | Transition-based | 83.17% | 76.70% |
Graph-based | 84.45% | 78.56% | |
Luong et al. [18] | Transition-based | 73.03% | 66.35% |
Some results on the other dependency banks in Vietnamese | |||
Kiem-Hieu [15] | Graph-based | 84.4% | 81.4% |
Dat Quoc et al. [14] | Graph-based (MSTParser) | 79.08% | 71.66% |
Dat Quoc et al. [13] | Graph-based (Neural network) | 80.66% | 73.53% |
4 Conclusion
In this paper, we presented in detail to contribute Vietnamese universal dependency. We also use this data in the Bist-parser system which is based on bidirectional LSTMs for dependency parser. We evaluated the accuracy of the system for Vietnamese parsing in two cases: with or without using the distributed word representations feature in the Bist-parser system.
The accuracy of our system is UAS=78.17% and LAS= 74.84% when we use gloVe model for producing distributed word representations on Vietnamese universal dependency. This result is the highest accuracy in comparison with the previous researches. It increases about 5.0%, with details increasing from 73.21% to 78.17% and from 68.32% to 74.84% for USA and LSA respectively. This system gets state of the art performance on Viettreebank [18] with UAS=84.45% and LAS=78.56%.
In the future, we will integrate the CRF into this system. We also conduct another approach to apply this model to a constituency-based structure in Vietnamese.