Arabic Dialect Identification based on Probabilistic-Phonetic Modeling

Terbeh, Naim; Maraoui, Mohsen; Zrigui, Mounir; Terbeh, Naim; Maraoui, Mohsen; Zrigui, Mounir

doi:10.13053/cys-22-3-3020

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.22 no.3 Ciudad de México jul./sep. 2018

https://doi.org/10.13053/cys-22-3-3020

Articles of the Thematic Issue

Arabic Dialect Identification based on Probabilistic-Phonetic Modeling

Naim Terbeh¹

Mohsen Maraoui²

Mounir Zrigui²

^¹ LaTICE laboratory, Speech processing team, Tunisia

^² Réseau National Universitaire Tunisien, Faculty of sciences, Tunisia

Abstract:

The identification of Arabic dialects is considered to be the first pre-processing component for any natural language processing problem. This task is useful for automatic translation, information retrieval, opinion mining and sentiment analysis. In this purpose, we propose a statistical approach based on the phonetic modeling to identify the correspondent Arabic dialect for each input acoustic signal. The main idea consists first, and for each dialect, in calculating a referenced phonetic model. Second, for every input audio signal, we calculate an appropriate phonetic model. Third, we compare this latter to all referenced Arabic dialect models. Finally, we associate the input acoustic signal to the dialect where the referenced phonetic model minimizes the cosine similarity. The obtained results are satisfactory. Indeed, based on 117 audio sequences, we attain a classification rate of 93%. Supporting the achieved results and the coverage of most of Arabic dialects, this study can be a reference for future work addressing dialectical speech processing applications.

Keywords: Arabic dialects; probabilistic-phonetic model; dialect identification; cosine similarity

1 Introduction

Human-machine communication is in full progress, thus facilitating the accessibility to information and its treatments by introducing new faster methods to access information like voice commands. Nonetheless, the dialectal variability can prevent much understanding of the vocal command. In order to assist interactive systems in the comprehension of transmitted messages, several studies addressing the Arabic dialect identification have been established.

The Arabic dialect identification has become the central task for most applications of Arabic speech processing, such as machine translation, speech recognition or social media analysis. In accordance with Zaidan et al. [¹], dialect identification can be seen as an application of language identification applied to a group of closely related languages.

The literature contains recent work that has proposed statistical approaches for Arabic dialect identification. However, current methods are often based on linguistic resources (corpora, lexicon, dictionaries), which does not always exist, especially for Maghrebi Arabic. For this reason, we propose a combination between linguistic and numerical methods to identify the dialectal origin for each input audio signal. The sample of dialects covers Tunisian, Algerian, Moroccan, Syrian, Palestinian and Egyptian.

This paper is structured as follows. The different dialects of spoken Arabic language are briefly described in section 2. In section 3, we expose some work from the literature addressing the Arabic dialect identification. The proposed methodology is detailed in section 4. The details about the experiments are described in section 5. The concluding remarks and future work are mentioned in section 6.

2 Dialectical Variability

The speakers of the Arabic language use in their daily discourses various dialects that can be considered as alternatives of the Modern Standard Arabic (MSA). Tunisian, Algerian and Moroccan dialects share several phonological traits among themselves thanks to their common history. Their lexicon contains several pronunciations inherited from other languages like Berber, French, Turkish, Italian and Spanish. Also, Syrian, Palestinian and Egyptian dialects share a lot of phonological features. In the following sub-sections, we briefly describe these dialects.

2.1 Tunisian Dialect

Similar to other Maghrebi dialects, Tunisian vocabulary is generally Arabic, with some Berber words. However, it is morphologically and phonologically different from the MSA. The Tunisian dialect is very agglutinative: Speakers use often very few words where just one expresses a whole sentence. It differs from the MSA especially in its negation form where the markers are always agglutinated to other words as affixes or suffixes. Moreover, in the Tunisian dialect, several Arabic words are used with significant changes in their stem formation.

2.2 Algerian Dialect

The Algerian dialect is an informal spoken language, not used in official speech. Its vocabulary is roughly similar throughout Algeria. Nevertheless, in the east of the country, the dialect is closer to the Tunisian one whereas in the west it is closer to the Moroccan one. Most of the words of Algerian dialect come from the MSA [²], but there is a significant variation in vocalization in most cases, and some omission of some sounds in other cases. Contrary to the MSA, few sounds are not used in Algerian discourses like ظ and ذ, where most of the time they are respectively pronounced as ض and د. Furthermore, the Algerian dialect uses some non-Arabic sounds like ڤ and پ.

2.3 Moroccan Dialect

The Moroccan dialect or the Moroccan Darija is a member of the Maghrebi Arabic language continuum spoken in Morocco.

It is mutually intelligible to some extent with the Algerian dialect and to a lesser extent with Tunisian one. It has been significantly influenced by other vocabulary like Berber, Latin, French and Spanish.

2.4 Egyptian Dialect

The Egyptian Arabic is a North African dialect of the Arabic language, which is a branch of the Afro-Asiatic language. It originated in the Nile in Lower Egypt around the capital Cairo. Egyptian Arabic evolved from the Quranic Arabic, which was brought to Egypt during the seventh-century Muslim conquest that aimed to spread the Islamic faith among the Egyptians. The Egyptian dialect was very highly influenced by the Coptic language, which was the native language of the Egyptians prior to the Arab conquest [³], and later it was significantly influenced by other languages such as French, Italian, Turkish and English.

2.5 Syrian and Palestinian Dialects

The Syrian and Palestinian dialects are part of Levantine Spoken Arabic, which covers also Lebanese and Jordanian dialects. Phonologically, structurally and lexically, we can mention several common features between Levantine Arabic and other varieties of Arabic. On the other hand, there are significant differences among Levantine dialects based on geographical areas and urban/rural division. The Syrian dialect is highly influenced by the Syrian language, a Semitic language of the Middle East, which belongs to the Aramaean language group and contains a large vocabulary inherited from Turkish and French languages. The Palestinian dialect presents phonetically slightly different compared to north Levantine dialects. It can follow two main varieties: urban and countryside. It can be also classified geographically into north and south.

3 State of the Art

The literature has presented multiple studies addressing dialect identification. In such work, researchers have tried to develop new platforms whose goal has been to associate the adequate dialect to each input acoustic signal. We can mention the following:

- In the objective of identifying Arabic and Chinese dialects, Zhang et al. recommended a novel study based on the frequency of common n-grams. The main idea is to classify the input acoustic signal in the class, which maximizes the number of n-grams compared to a reference acoustic base. It is the target of the work in [⁵]. Compared to the existing systems, the obtained results showed a strong correlation between dialect-salience and the frequency of occurrences in n-grams.
- To identify Jordanian and Egyptian dialects, Al-Ayyoub et al. invented in [⁶] a novel methodology based on the combination between different numerical and linguistic audio techniques. Through this study, the authors put forward a new solution to the problem of dialect identification and determined as well the combination of features/classifiers that would generate the best results. Based on a large corpus of Jordanian and Egyptian dialects, the suggested system showed a good performance.
- In [⁷], Guellil et al. proposed an unsupervised approach to identify the Algerian dialect within social media. In order to do so, the authors used a large Algerian dialectal lexicon. The proposition was based on the improved Levenshtein distance [¹¹, ¹²]. Supporting a corpus of 100 messages that were collected using the Facebook API, the authors obtained an identification rate exceeding 60%.
- Based on the naive Bayesian algorithm and a transfer system, Hamada et al. put forward in [⁴] a novel study addressing the identification of the Egyptian dialect from text messages that circulated on social networks and generated the corresponding MSA representation. Using 3,000 words presenting the Egyptian dialect, the authors attained an identification rate, which exceeded 92%.

In spite of the richness of the literature with studies addressing Arabic dialect identification, the intra-dialect variability presents a motivation to propose new robust features and new measures of similarity. Our contribution consists, for each input acoustic signal, in introducing a new methodology whose target is to calculate its similarity compared to referenced dialectal bases.

4 Methodology

The goal of this paper is to assign each input acoustic signal to an adequate Arabic dialect. The main idea consists in comparing between the phonetic model representing the input acoustic signal and the referenced models of different Arabic dialects. In accordance with the results of this comparison, we assign the input acoustic signal to the class which minimizes the cosine similarity. Figure 1 describes the operation of the proposed system:

Fig. 1 General form of proposed methodology

4.1 Acoustic Modeling

The generation of phonetic models referenced to Arabic dialects requires an acoustic model trained to different spoken Arabic varieties and a large base of dialectical speeches. The speech base must be recorded by native speakers and cover all dialectical variabilities. The following table (Table 1) summarizes the corpora used to train our acoustic model.

Table 1 Summary of base of speeches used to train the acoustic model

Objective	Speech bases and sizes
	77 minutes of dialectical Tunisian speeches	388 minutes
	57 minutes of dialectical Algerian speeches
Training an acoustic model for dialectical Arabic language	66 minutes of dialectical Moroccan speeches
	62 minutes of dialectical Syrian speeches
	70 minutes of dialectical Palestinian speeches
	56 minutes of dialectical Egyptian speeches

4.2 Forced Alignment

Supporting the Sphinx_align tool, we make the forced alignment procedure that allows generating for each speech signal the suitable phonetic transcription. For this treatment, it is sufficient to give this tool the paths to the necessary data, which are the voiced signals in the MFCC format, the suitable phonetic transcriptions, the pronunciation dictionary and the acoustic model. The obtained phonetic transcription (previous sub-section) will be decomposed by bi-phonemes and the probability of occurrence for each bi-phoneme will be calculated. The vector arranging all these probabilities forms the phonetic model referenced to the dialect covered by the input speeches.

Table 2 gives details about the base of speeches used to generate dialectic phonetic models.

Table 2 Summary of base of speeches used to calculate dialectical phonetic models

Objective	Speech bases and sizes
Calculating dialectical phonetic models	88 minutes of dialectical Tunisian speeches	413 minutes
	67 minutes of dialectical Algerian speeches
	78 minutes of dialectical Moroccan speeches
	51 minutes of dialectical Syrian speeches
	58 minutes of dialectical Palestinian speeches
	71 minutes of dialectical Egyptian speeches

We must guarantee a correlation between all phonetic models in the presentation and in the order of bi-phonemes. For example, the Algerian dialect includes the phonemes “ڤ” and “پ”, which is not the case for other Arabic dialects, so all models must comprise bi-phonemes containing these phonemes in the same order compared to the phonetic model referenced to the Algerian dialect.

Figure 3 illustrates an extract from the phonetic model referenced to the Tunisian dialect.

Fig. 2 Sphinx_align procedure4.3 Phonetic similarity and classification

Fig. 3 An extract from Tunisian dialectical phonetic model

For each new speech (S), to be associated to a suitable dialect, we transform this acoustic signal to its phonetic form (phonetic model), following the same procedure in section 4.2. We calculate, thereafter, all angles θ_S,i (i=1, …, 6) which separate this model for each one of the dialectical sample studied in this paper (Tn: Tunisian, Al: Algerian, Mr: Moroccan, Sy: Syrian, Pl: Palestinian and Eg: Egyptian). Finally, the input speech will be assigned to the dialect that minimizes the angle of similarity.

For example, a speech sequence S is associated to an appropriate dialectical class as follows:

- We calculate Ɵ = {θ_S,Tn, θ_S,Al, θ_S,Mr, θ_S,Sy, θ_S,Pl, θ_S,Eg} is the set of all phonetic similarity distances between the input speech sequence and all other spoken Arabic varieties.
- We suppose Min(Ɵ) is the minimum of the set Ɵ.
- The dialect which verifies the Min(Ɵ) is the most adequate to the input speech sequence.

The calculation of the set Ɵ is based on the following scalar product formulas:

V1×V2=∑i=1nV1[i]×V2[i], (1)

V1×V2=‖V1‖×‖V2‖×cos⁡(α), (2)

where V1 and V2 are two vectors representing two different phonetic models, α is the angle that separates V1 and V2 V2 . Thus, we conclude:

cos⁡(α)=V1×V2‖V1‖×‖V2‖. (3)

5 Tests and Results

5.1 Test Conditions

The test is done under the following conditions:

- We prepare an acoustic model trained to dialectical Arabic speeches. For this objective we utilize a voice corpus of 388 minutes of Arabic speech which covers Tunisian, Algerian, Moroccan, Syrian, Palestinian and Egyptian dialects ( on *.wav format and in mono speaker mode).
- We calculate six phonetic models, one for each Arabic dialect covered by this study. We use, for this goal, an acoustic base containing 413 minutes of Arabic dialectical speeches (on *.wav format and in mono speaker mode).
- We test the performance of the suggested method based on 117 speech sequences recorded by native speakers, which covers all the studied Arabic spoken varieties in this paper. All records follow the *.wav format and the mono speaker mode. The repartition of the test base is as follows:

5.2 Experimental Results

In this section, we are interested in measuring the similarity between each pair of Arabic dialects through phonetic models that reference different Arabic spoken varieties. For this purpose, we choose to use the cosine similarity [⁸, ⁹], a measure of correlation between documents. It quantifies the similarity between two phonetic models. The choice of this metric is justified by its performance guaranteed in the document comparison [¹⁰].

The following table (Table 4) illustrates the results of dialectical speech classification and describes the confusion rate inter-dialects. Figure 4 details the performance of our proposed approach for each Arabic dialect.

Table 3 Summary of base of speeches used to evaluate proposed methodology

Objective	Utilized test base
Testing proposed methodology	26 sequences of dialectical Tunisian speeches	117 sequences
	21 sequences of dialectical Algerian speeches
	15 sequences of dialectical Moroccan speeches
	17 sequences of dialectical Syrian speeches
	22 sequences of dialectical Palestinian speeches
	16 sequences of dialectical Egyptian speeches

Table 4 Classification results using proposed method

Test bases	Classification results
Test bases	Tn	Al	Mr	Sy	Pl	Eg
Tn	92.30%	0%	7.69%	0%	0%	0%
Al	0%	95.23%	4.76%	0%	0%	0%
Mr	0%	6.66%	93.33%	0%	0%	0%
Sy	0%	0%	0%	94.11%	5.88%	0%
Pl	0%	0%	0%	9.09%	90.90%	0%
Eg	0%	0%	0%	0%	6.25%	93.75%

Fig. 4 Classification rate for each Arabic dialect

5.3 Discussion

Table 4 draws some confusions between Arabic dialects. It is clearly shown that the highest confusion rates are those between Algerian and Moroccan and between Palestinian and Syrian dialects. This confusion is justified by the closeness between these pairs of dialects; e.g., Palestinian and Syrian dialects share significant vocabulary.

Some misclassification speech sequences can be justified by the shortness of the acoustic signal. Indeed, the dialectical speech sequence to be classified is short. Probably, the phonetic model does not cover all possible bi-phonemes, so the similarity with the referenced phonetic model will be falsified. Pathological speech may also result in misclassification [¹⁵, ¹⁶, ¹⁷].

6 Conclusions and Future Work

To conclude, we can mention that the comparison between phonetic models presents a deciding factor to classify Arabic dialectical speeches. In this paper, we have put forward a probabilistic-phonetic methodology to assign each input acoustic signal to a suitable Arabic dialect. For this purpose, a corpus of 413 minutes of Arabic dialectical speeches has been prepared to calculate the phonetic models referring to the different spoken Arabic dialects. Another corpus containing 388 minutes of Arabic speeches covering six Arabic dialects has been recorded to train the acoustic model.

Based on 117 speech sequences, our proposed method has presented a high performance. Indeed, we have had 93% as a classification rate of Arabic dialects and we have extracted some confusion inter-dialects that confirm the closeness between these dialects.

To the best of our knowledge, this work presents the widest dialectical coverage. We are satisfied with the obtained results, and our suggested approach can present an important reference for work focalizing on the classification of dialectical speeches.

As future work, we can extend this study to elaborate a new platform whose goal is to transform an input acoustic signal from the dialectical form to its adequate one in MSA [¹³, ¹⁴]. Indexing the spoken content can also benefit from our methodology [¹⁸].

Acknowledgements

First, we would like to express my deepest regards and thanks to all members of the scientific committee within the Computación y Sistemas journal. We extend also my advance thanks to our supervisors for their valuable advices and encouragement.

References

1. Zaidan, O. F., & Callison-Burch, C. (2011). The Arabic online commentary dataset: An annotated dataset of informal Arabic with high dialectal content. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), Vol. 2, pp. 37-41. [ Links ]

2. Meftouh, K., Bouchemal, N., & Smaili, K. (2012). A Study of a Non-resourced Language: an Algerian Dialect. Third International Workshop on Spoken Languages Technologies for Under-resourced Languages, pp. 125-132. [ Links ]

3. Nishio, T. (1996). Word order and word order change of wh-questions in Egyptian Arabic: The Coptic substratum reconsidered. Proceedings of the 2nd International Conference of L'Association Internationale pour la Dialectologie Arabe, University of Cambridge, pp. 171-179. [ Links ]

4. Hamada, S., & Marzouk, R. (2018). Developing a Transfer-Based System for Arabic Dialects Translation. Intelligent Natural Language Processing: Trends and Applications, Vol. 740, pp. 121-138. [ Links ]

5. Zhang, Q., Bořil, H., & John, H. L. H. (2013). Supervector pre-processing for PRSVM-based Chinese and Arabic dialect identification. IEEE International Conference on Acoustics, Speech and Signal Processing. [ Links ]

6. Al-Ayyoub, M., Rihani, K., Dalgamoni, I., & Abdulla, A. (2014). Spoken Arabic dialects identification: The case of Egyptian and Jordanian dialects. 5th International Conference on Information and Communication Systems. [ Links ]

7. Guellil, I., & Azouaou, F. (2016). Arabic Dialect Identification with an Unsupervised Learning. IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering. [ Links ]

8. Nguyen, H. V., & Bai, L. (2010). Cosine similarity metric learning for face verification. Asian Conference on Computer Vision, pp. 709-720. [ Links ]

9. Ye, J. (2011). Cosine similarity measures for intuitionistic fuzzy sets and their applications.Mathematical and Computer Modelling, Vol. 53, No. 1, pp. 91-97. [ Links ]

10. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD workshop on text mining, Vol. 400, No. 1, pp. 525-526. [ Links ]

11. Ho, T., Oh, S. R., & Kim, H. (2017). A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations.PloS one, Vol. 12, No. 10. [ Links ]

12. Lee, T. (2017). Levenshtein distance-based regularity measurement of circadian rhythm patterns.Journal of Theoretical & Applied Information Technology, Vol. 95, No. 18. [ Links ]

13. Harrat, S., Meftouh, K., & Smaili, K. (2017). Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid. 18th International Conference on Computational Linguistics and Intelligent Text Processing. [ Links ]

Malmasi, S., & Zampieri, M. (2017). Arabic dialect identification using iVectors and ASR transcripts. [ Links ]

15. Terbeh, N., Maraoui, M., & Zrigui, M. (2015). Probabilistic Approach for Detection of Vocal Pathologies in the Arabic Speech. CICLing. [ Links ]

16. Terbeh, N., Trigui, A., Maraoui, M., & Zrigui, M. (2016). Arabic speech analysis to identify factors posing pronunciation disorders and to assist learners with vocal disabilities. IEEE International Conference on Engineering & MIS. [ Links ]

17. Terbeh, N., & Zrigui, M. (2017). Identification of Pronunciation Defects in Spoken Arabic Language. International Conference of the Pacific Association for Computational Linguistics, pp. 355-365. [ Links ]

18. Labidi, M., Terbeh, N., Maraoui, M., & Zrigui, M. (2015). Semantic indexing of continuous speech. Information & Communication Technology and Accessibility. [ Links ]

Received: December 19, 2017; Accepted: March 01, 2018

Corresponding author is Naim Terbeh. naim.terbeh@gmail.com, maraoui.mohsen@gmail.com, mounir.zrigui@fsm.rnu.tn

This is an open-access article distributed under the terms of the Creative Commons Attribution License