1 Introduction
Nowadays, the Internet has overwhelmed by the multi-lingual content. Classical Information Retrieval (IR) considers the documents and sentences in other languages as unwanted noise [1]. Global internet usage statistics shows that the number of web access by the non-English users is tremendously increased, but all the non-English users are not able to express their queries in English1, so there is a need for handling multiple languages arises which introduces a new area of IR that is Cross-Lingual Information Retrieval (CLIR).
CLIR provides the accessibility of relevant information in a language different than the query language [10], it can be presumed as a translation technique followed by monolingual information retrieval. CLIR follows two types of translation techniques, namely, query translation and documents translations.
A lot of computation time and space is elapsed in document translation, so query translation is preferred [11], where Dictionary-based Translation (DT), Corpus-based Translation (CT) and Machine Translation (MT), are the conventional translation techniques [19]. Manual construction of a dictionary is a cumbersome task and the MT internally uses a parallel corpus, therefore, the researchers put their efforts towards the development of effective and efficient MT system and corresponding translation resources.
In this paper, an SMT system is trained with the different experimental parameters to translate the Hindi language queries, which are evaluated by using BLEU score for MT and Mean Average Precision (MAP) for the Hindi-English CLIR. Since the SMT is not able to resolve the issue of morphological variants, therefore, a Translation Induction Algorithm (TIA) is proposed which incorporates morphological variants solutions.
The literature survey is represented in section 2. Section 3 discusses an SMT system. The propo-sed TIA is discussed in section 4. Experimental results and discussions are represented in section 5 and section 6 concludes the paper.
2 Literature Survey
The direct translation approaches, DT, CT, and MT, and the indirect translation approaches, Cross-Lingual Latent Semantic Indexing (CL-LSI), Cross-Lingual Latent Dirichlet Allocation (CL-LDA), Cross-Lingual Explicit Semantic Analysis (CL-ESA) are the CLIR approaches [14, 21]. A manual dictionary is used for translation and a transliteration mining algorithm is used to handle Out Of Vocabulary (OOV) words which are not present in the dictionary [16]. The transliteration generation or mining techniques are used to handle the OOV words [3, 13, 17].
Term Frequency Model (TFM) includes the concept of a set of parallel sentences and cosine similarity [15]. The dual semantic space based translation models CL-LSI, CL-LDA are effective but not efficient [18]. A Statistical Machine Translation (SMT) system is trained on a parallel corpus [5].
An open source machine translation toolkit Moses2 was developed which was language-independent [8] where the phrasal translation technique enhances the power of MT [4]. Neural networks impart a significant role in the field of data mining. A Neural Machine Translation (NMT) system was developed and evaluated for the various foreign language but not for Hindi [20].
It is very tedious to develop and evaluate MT systems for Hindi-English language pair, a sentence-aligned parallel corpus HindiEnCorp3 was developed and evaluated for MT system [2]. A recently developed sentence-aligned Hindi-English parallel corpus by IIT Bombay which is a superset of the HindiEnCorp, is experimented for the SMT and NMT system, and it is concluded that the SMT performs better than the NMT [9].
3 Statistical Machine Translation
SMT employs four components, i.e., word translation, phrasal translation, decoding, and language modeling [7].
3.1 Word Translation
An IBM model is used to generate the word alignment table from the sentence aligned parallel corpus. The Hindi and English language sentences are given as h = {h 1, h 2, ..., hm} of length m, and e = {e 1, e 2, ..., en} of length n.
An alignment function a : j → i for an English word e j to a Hindi language word h i is given as:
where
A source language word is likely to be aligned with different target language words in the different iterations, so an Expectation Maximization (EM) algorithm is used to eliminate this issue. It follows an Expectation step where the probability of alignment is computed and a Maximization step where the model is estimated from the data. EM step is continuously applied until the convergence.
Expectation Step: The probability of alignment p(a|e, h) is computed as:
where p(e, a|h) computed by using equation 1, and p(e|h) is calculated as:
Maximization Step: It includes the collection count step where the sentence pairs (e, h), e is a translation of h, are counted:
The different variation of IBM model and Hidden Markov Model (HMM) are used for word alignment. The GIZA++4 tool implements an IBM Model 5 and HMM alignment model.
3.2 Phrasal Translation
Phrase model is not limited to only linguistic phrases which are noun phrases, verb phrases, prepositional phrases etc. It includes two steps, extraction of phrase pairs and scoring phrase pairs. The phrase pairs are extracted in such a way that they should be consistent with the word alignment. A phrase pair
A translation probability is assigned to each phrase pair by calculating the relative frequency:
3.3 Decoding
The best target language translation e best with the highest translation probability is identified at the decoding stage:
where
4 Proposed Algorithm
A Translation Induction Algorithm (TIA) is proposed in Algorithm 1, which incorporates Refined Stop-Words and Morphological Variants Solutions.
Refined Stop-Words (RSW): Stop-words are the frequently occurring words which are consi-dered as the noise in Mono-Lingual Information Retrieval (MoLiIR). In CLIR scenario, a source and target language stop word may have multiple meaningful target and source language translations respectively, hence, the stop-words impart a significant role in CLIR, such stop-words examples are presented in Table 1. The meaningful stop-words are eliminated from the source and target language standard stop-words lists, such meaningful stop-words are listed in Table 2.
Morphological Variants Solutions (MVS): The maximum Longest Common Subsequence Ratio (LCSR) score is used to select the approximate nearer word if a source language query word is not present in the exact form in parallel corpus, but the LCSR is not sufficient for morphologically rich language, therefore, following MVS solutions are additionally added to trace the approximate nearer word. An LCSR score between two strings a and b is computed as follows:
where LCS(a,b) returns longest common sub-sequence string between the strings a and b.
Equality of nukta character with the corresponding non-nukta character: LCSR is unable to detect the equality between the nukta and non-nukta characters. The words with nukta characters are like (sadak), (ladai), (parvez). Target documents contain many words with nukta and non-nukta characters, so an equality solution is applied where nukta character and non-nukta characters are equally considered.
Auto-correction of query words: Query words are searched in the parallel corpus as they appear. The correctness of the query words is not verified. A word’s popularity based correctness solution is applied, where a query word’s frequency wf i is computed over the corpus and compares it against the empirically defined threshold T. If wf i is less than T, then the nearest word’s frequency cwf i , of the query word is computed with the help of LCSR. If cwf i > wf i , then query word is replaced by its nearest word. The examples of such words are shown in Table 3.
Equality of chandra-bindu with and : A query word with chandra-bindu may be equivalent to many other words, like a word “” (Ambani) has similar LCSR score of 0.83 with these three words “” (Ambani), “” (Ambaji), “” (Albani). If chandra bindu is considered as equivalent to “” then the word “” has the maximum LCSR score.
Auto-selection of the nearest query word: An LCSR score is used to select the nearest word if a word is exactly not found in the PC. A word may have multiple nearest words with the similar LCSR score as shown in Table 4. A Compressed Word Format (CWF) algorithm [6] is used for auto-selection of the nearest query words, so far, the CWF algorithm is used for transliteration mining. Further, a set of parallel sentences is selected for each query word w i from the parallel corpus, in a contextual manner such that each sentence contains either all three words of tri-gram and both of the words of bi-gram independent of word order, with the inclusion of w i .
5 Experiment Results and Discussion
If the number of selected parallel sentences is less than a threshold t, then z number of unigram based parallel sentences of minimum length is also included. These context-based selected sentences return the appropriate translation.
FIRE5 2010 and 2011 datasets are used to evaluate the CLIR system, while the WMT6 news test-set 2014 is used to evaluate the MT system. The dataset and resources which are used for MT and CLIR, are represented in Table 5 and 6. All three experimental setups of MT system are tuned and evaluated by using the common dev_set and test_set.
Training_set | Language Modeling |
Dev_set | Test_set |
HindiEnCorp | HindiEnCorp | WMT news test_set 2014 | |
IIT Bombay5 (1,492,827 sentences) | IIT Bombay | WMT Dev_set (520 sentences) | (2507 sentences), and Fire 2008,2010,2011, and 2012 query set (each have 50 sentences) |
WMT News 2015 Corpus (3.3 GB) |
Dataset Characteristic |
Fire 2010 | Fire 2011 | HindiEnCorp Parallel Corpus |
||
Query | Document | Query | Document | ||
Number of queries/sentence/documents | 50 | 125586 | 50 | 392577 | 273886 |
Average length (Number of Tokens) of query/sentence/document | 6 | 264 | 3 | 245 | 20 |
An SMT system is evaluated by using the BLEU score which computes the N-gram overlap between the MT output and the referenced translation. It computes precision for N-grams of size 1 to 4, which is given as:
BLEU score is computed for the entire corpus not for a single sentence [7]:
A CLIR system is evaluated by using Recall and Mean Average Precision (MAP). The Recall is the fraction of relevant documents that are retrieved as shown in Equation 14. MAP for a set of queries is the mean of the average precision score of the queries. Precision is the fraction of retrieved documents that are relevant to the query.
Average precision of a query is calculated in Equation 15:
Where k is the rank in the sequence of retrieved documents, n is the number of retrieved documents, p(k) is the precision at rank k, rel(k) is equal to 1 if the document at rank k is relevant otherwise 0.
5.1 Experimental Setup
User queries are translated by using an SMT system which is trained with three different experimental setups as follows.
— SMT_setup1, HindiEnCorp is used for both of the purposes of training and language modeling.
— SMT_setup2, A Hindi-English parallel corpus developed by IIT Bombay is used for both of the purposes of training and language modeling.
— SMT_setup3, A parallel corpus developed by the IIT Bombay is used for training, while the WMT news corpus 2015 is used for language modeling.
These experimental setups are tuned by using the common dev_set and test_set, which is shown in Table 5.
Fire 2010 and 2011 Hindi language query sets are translated by using the different SMT setups and the proposed approach, further, these translated queries are used to retrieve the target English language documents. HindiEnCorp is utilized as a parallel corpus in the proposed approach.
The Terrier7 open source search engine is used for indexing and retrieval. In our experiments, Terrier uses Term Frequency-Inverse Document Frequency (TF-IDF) for indexing and cosine similarity for retrieval.
5.2 Results and Discussions
An SMT system is trained in three ways, which are evaluated by using the BLEU score [12]. These trained SMT systems are evaluated for five different test_set, their BLEU scores are represented in Table 7. The News test_set 2014, Fire 2008, 2010, and 2011 test sets are evaluated against the corresponding human translated text, while Fire 2012 test set is evaluated against the Google translated text because the human translated text for Fire 2012 is not available.
Setups | News test set 2014 | Fire 2008 | Fire 2010 | Fire 2011 | Fire 2012 |
SMT_setup1 | 7.05 | 10.76 | 4.48 | 8.13 | 17.11 |
SMT_setup2 | 9.70 | 11.72 | 6.75 | 6.53 | 17.59 |
SMT_setup3 | 8.95 | 11.45 | 5.13 | 8.77 | 17.75 |
SMT_setup2 performs better than the SMT_setup1 for all five test cases. The performance of SMT_setup2 and SMT_setup3 are approximately similar, SMT_setup2 performs better for the first three test cases while in the last two test cases, the performance of SMT_setup3 is better.
Now, these SMT systems and the proposed TIA are evaluated for CLIR by using Recall and MAP, which is represented in Table 8.
Setups | Fire 2010 | Fire 2011 | ||
Recall | MAP | Recall | MAP | |
SMT_setup1 | 0.8575 | 0.2382 | 0.7088 | 0.1885 |
SMT_setup2 | 0.7718 | 0.2075 | 0.6602 | 0.1608 |
SMT_setup3 | 0.7978 | 0.1994 | 0.6602 | 0.1767 |
Proposed Approach | 0.8685 | 0.2818 | 0.7195 | 0.1816 |
The SMT_setup1 performs better than the SMT_setup2 and SMT_setup3 in perspective of CLIR. The SMT_setup1 is trained on HindiEnCorp which is smaller than the IIT Bombay parallel corpus, used in SMT_setup2 and SMT_setup3. Although the parallel corpus developed by IIT Bombay is a superset of HindiEnCorp, it is not so well-organized and mixes the noise in the translation, hence, the translation performance is poor in perspective of CLIR. The SMT setup3 uses WMT news corpus 2015 for language modeling, so it performs a little better than the SMT_setup2.
The proposed approach utilizes the well-organized HindiEnCorp as a parallel corpus. Refined stop-words and Morphological Variants Solutions improve the Recall and MAP for both of the Fire 2010 and 2011 datasets. In the perspective of CLIR, the proposed approach outperforms the Hindi-English SMT system which is trained on the best available resources.
6 Conclusion
CLIR retrieves the target documents which are in a language different than the query language, with the help of an MT technique. Source language user queries are translated by using different SMT setups and the proposed approach. HindiEnCorp is smaller than the parallel corpus developed by IIT Bombay, but it is better organized than the IIT Bombay corpus. SMT_setup1 performance is a little poor in perspective of MT system, but in perspective of CLIR, its performance is better than the other SMT setups.
Stop-words impart a significant role in CLIR, SMT does not deal with the stop-word and the morphological variant. The proposed approach improves the results and outperforms the SMT systems, as it deals with the stop-words and morphological variants.