SciELO - Scientific Electronic Library Online

 
vol.23 número3Semi-Automatic Knowledge Graph Construction by Relation Pattern ExtractionExtracting Context of Math Formulae Contained inside Scientific Documents índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • No hay artículos similaresSimilares en SciELO

Compartir


Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.23 no.3 Ciudad de México jul./sep. 2019  Epub 09-Ago-2021

https://doi.org/10.13053/cys23-3-3248 

Articles of the Thematic Issue

Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Sunita Warjri1  * 

Partha Pakray2 

Saralin Lyngdoh1 

Arnab Kumar Maji1 

1 North-Eastern Hill University, Shillong, Meghalaya, India. sunitawarjri@gmail.com, saralyngdoh@gmail.com, arnab.maji@gmail.com.

2 National Institute of Technology, Silchar, Assam, India. parthapakray@gmail.com.


Abstract

Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios, as many different technologies are involved in making machines to understand human languages. Khasi is the language which is spoken in Meghalaya, India. Many Indian languages have been researched in different fields of Natural Language Processing (NLP), whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS tagging as one of the key aspects of NLP, we present POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage of building NLP system for Khasi, with the analyses of the categories and structures of the words is started. Therefore, we have designed specific POS tagsets to categories Khasi words and vocabularies. Then, the POS system based on HMM is trained by using Khasi words which have been tagged manually using the designed tagsets. As ambiguity is one of the main challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first few sets of data in the experimental data by using the HMM tagger we found out that the result yielded by this model is 76.70% of accurate.

Keywords: Natural language processing (NLP); computational linguistic; part of speech (POS); POS tagger; hidden Markov model (HMM)

1 Introduction

Natural Language Processing (NLP) deals with the inter-relation and inter-communication between the computer and natural human language by combining the technology of artificial intelligent and computer science. The most important part of any NLP task is the issue of understanding the natural language. Application of NLP helps machines to learn, read and understand the human language, by simulating the human ability of understanding the language and by combining the technology of computational linguistics, computer science and artificial intelligence.

The most basic and important starting level of NLP for any language is POS tagging. POS in language processing is the aspect that deals with the identification of grammatical class of each word in a given sentence. POS is used in many fields of NLP such as Semantic Disambiguation, Phrase identification (chunking), Named Entity Recognition, Information Extraction, Parsing, etc. [2,17]. The task of creating POS Tagger involves many stages such as: building tagsets, creating dictionary, considering the rules of the context and also checking the inflexions, dependent anomalies of the particular language.

Though, substantial work has already been carried out in different fields of NLP for Indian Language, Khasi lacks such study from the NLP perspective. Other Indian languages like Hindi, Bengali, Assamese, Manipuri, Marathi, Tamil, etc. have already been employed in Computational linguistic. In this paper we present POS tagger for Khasilanguage. KhasiLanguage is an official language of the state Meghalaya, in North East India [16].

The name 'Khasi' classify both the tribe and as well as the language. Khasi Language is spoken in the Khasi Hills district of the state Meghalaya, India. Also, this language is spoken in the border area of Assam -Meghalaya as well as India-Bangladesh border. Khasiis a part of Mon-Khmer family which is branch of Austro-Asiatic, Southeast of Asia. There have been very limited research works in computational linguistics with regard to Khasi Language.

To perform research work on NLP there is a need of corpus in Khasi Language. Dataset or Corpus is basically a large and extensive collection of texts or words, which are used for analysis of any Natural Language. Corpus is an essential component for any Natural Language Processing research. Therefore, in this paper we describe tagger for Khasi Language based on supervised system HMM trained model.

The paper is organized as follows: Section 2 describes related works on POS Tagging; Section 3 describes Methodology used in HMM based POS Tagger system; Section 4 describes experimental results; Section 5 Conclusions and some future perspectives.

2 Literature Review

In this section, the existing relevant works in the POS tagging of different languages are presented. There are several existing research works on POS tagging on different languages such as Indian, English, German, Spanish, etc. whereby many researchers have also proposed different methods for POS tagging and show the achieved results.

A Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Hindi language as discussed in [5]. Indian Language (IL) POS tag set have been employed for the system. In the experimental result the HHM POS tagger acquires accuracy of 92%.

A POS tagging for Manipuri language demonstrated 69% of accuracy by using morphology driven POS tagger in [12]. This POS tagger uses 10917 unique words and three dictionaries consisting of prefixes, suffixes and root words and also information with respect to the text content.

Part-of-speech (POS) Tagger for Malayalam Language using supervised learning based on Support Vector Machine (SVM)as discussed in [1]. For the tagger, tag-set consisting 29 tags was developed. The corpus with dataset consisting 180,000 words, the system achieved 94% of accuracy.

Part-of-Speech Tagging for Marathi Language discussed in [8] using Training data of 576 words with tag-set of 9 tags. Tokenization, Morphological analysis and Disambiguation process have been carried out for POS tagging. The author concluded with the attained accuracy of 78.82% by the system.

Kokborok language based on rule based, Conditional Random Field (CRF) and Support Vector Machines (SVM) for Part of Speech (POS) Tagger discussed in [9]. A tagged dataset of 42,537 words with 26 tag were used. The POS taggers methods attains the accuracies of 69% for rule based, 81.67% CRF based and 84.46% for SVM.

In [10] has discussed POS Tagging and Chunking using Conditional Random Fields and Transformation. The POS tagger accuracy achieved for CRF and TBL of about 77.37% for Telugu, 78.66% for Hindi, and 76.08% for Bengali and the chunker performance accuracy of 79.15% for Telugu, 80.97% for Hindi and 82.74% for Bengali respectively.

Part-of-Speech (POS) tagger for Bengali language in [3] has been reported. Tagging based on Hidden Markov Model (HMM) and Maximum Entropy (ME) stochastic taggers has been discussed. Study is conducted to improve the efficiency and performance of the tagger by using a morphological analyzer. An accuracy of 76.80% has been reported. The author reported to achieved good performance with the suffix information and morphological restriction on the grammatical categories for the supervised learning model.

In the paper [4] the POS tagger based on SVM and HMM for Bengali has been proposed. The author show the result as the accuracy of 79.6% for the manually checked corpus consisting 0.128 million words.

The result of developed POS taggers were given as the accuracies of 85.56% for HMM, and 91.23% for SVM.

Part of Speech tagging for Assamese was reported in paper [11]. The system was design based on Hidden Markov Model approach (HMM). With tagsets consisting 172 tags and corpus consisting 10000 words which were manually tagged for training the system. The accuracy of 87% was achieved by the HMM POS tagger.

For khasi language the author in [13] had introduce tagsets consisting 61 tags. In the paper [14] the author had also introduce Morphological Analyzer for Khasi language. Using morphotactic rules the author had used dictionary consisting 8000 words. The analyzing system use based word of the word classes with grammatical relationship of subject verb object. For deriving the morphotactic rules the prefixes, infixes and suffixes of the words had been made used.

3 Methodology for HMM POS tagger

In this paper, the POS tagger for Khasi language based on Hidden Markov Model (HMM) as supervised learning has been employed. In the subsection below, we present the discussion about the methods that have been carried out in this work for building the supervised POS Tagger:

3.1 Tag Sets

Tag is the label that is used to describe a grammatical class of information (these are: nouns, verbs, pronouns, prepositions, and so on). For example: NN could represent the Noun class, JJ the Adjective class, PRP the Pronoun class, etc. Each language has a different pattern, frequency, and speaking style. Thus, the grammatical class is also different for different languages. Therefore, for this work we have designed tagset consisting 54 tags for identifying the grammatical class or Part-of-Speech (POS) of Khasi Language [15] as listed in the Table 1 below.

Table 1 POS tagset for Khasi Language [15] 

No. Tag Description
1 PPN Proper nouns
2 CLN Collective nouns
3 CMN Common nouns
4 MTN Material nouns
5 ABN Abstract nouns
6 RFP Reflexive Pronoun
7 EM Emphatic Pronoun
8 RLP Relative Pronouns
9 INP Interrogative Pronouns
10 DMP Demonstrative Pronouns
11 POP Possessive Pronoun
12 CAV Causative Verb
13 TRV Transitive verb
14 ITV Intransitive verb
15 DTV Ditransitive verb
16 ADJ Adjective
17 CMA Comparative Adjective marker
18 SPA Superlative Adjective marker
19 AD Adverb
20 ADT Adverb of Time
21 ADM Adverb of Manner
22 ADP Adverb of Place
23 ADF Adverb of frequency
24 ADD Adverb of degree
25 IN Preposition
26 1PSG 1st Person singular common gender
27 1PPG 1st Person plural common gender
28 2PG 2ndPerson singular/plural common gender
29 2PF 2ndPerson singular/plural Feminine gender
30 2PM 2ndPerson singular/plural Masculine gender
31 3PSF 3rd Person singular Feminine gender
32 3PSM 3rd Person singular Masculine gender
33 3PPG 3rd Person plural common Gender
34 3PSG 3rd Person singular common Gender
35 VPT Verb, present tense
36 VPP Verb, present progressive participle
37 VST Verb, past tense
38 VSP Verb, past perfective participle
39 VFT Verb, future tense
40 Mod Modalities
41 Neg Negation
42 CLF Classifier
43 COC Coordinating conjunction
44 SUC Subordinating conjunction
45 CRC Correlative conjunction
46 CN Cardinal Number
47 ON Ordinal number
48 QNT Quantifiers
49 CO Copula
50 InP Infinitive Participle
51 PaV Passive Voice
52 COM Complementizer
53 FR Foreign words
54 SYM Symbols

For more details regarding the designed tag-sets of Khasi language can be found in [15] respectively.

3.2 Data Sets for Corpus Building

In linguistics, the corpus is the large number of data or written texts, which is collected for analyzing in computational linguistic. For this work, the corpus consists of the collected words and each word consists with its corresponding tags. It is found that no such Khasi corpus is available till date.

Therefore, in this work the Khasi corpus has been built. The corpus consists of Khasi language based on context, by collecting the written text from online Khasi newspaper. These raw texts are then tag appropriately to each word by using the designed tagset respectively. Tagging to words at this stage has been done manually. Therefore, we were able to create 7,500 words in our data set.

The corpus has been built painstakingly with care so that it can efficiently handle the problem of Ambiguity and also orthography. In this work, we aim to achieve standard Khasi corpus for POS tagging. Therefore, the dataset have also been built under the observation and validation of a linguistic expert from North Eastern Hill University, Department of Linguistic, Shillong, Meghalaya, India. Then, this dataset has been used for training and testing the tagger.

3.3 POS Identification based on HMM for Khasi Language

This subsection presents the steps for POS tagger based on Hidden Markov Model (HMM) that has been carried out in this work. Part-of-speech tagging is a sequence classification problem. The main objective of this supervised machine learning HMM trained model is to give the most i.e. the maximum probable tag y as outputs for the given word x, as shown in equation (1):

fx=argmaxyYPyx. (1)

The following are the elements consist in the HMM POS tagger system and the steps followed for calculating the efficiency of the tagger:

  1. A finite set of words, W=w1,w2,,wn .

  2. A finite tags T=t1,t2,,tn, .

  3. n is the number in length.

Therefore, from our objective, to find the optimal tags sequence tn^ we have equation (2) and (3):

tn^=argmaxtnPtnwn (2)

tn^=argmaxtnPtnPwntn. (3)

NOTE: Prior probability - Ptn and Likelihood probability - Pwntn .

  • 4.Two properties of assumptions are considered in HMM based POS taggers, shown in equation (4) and (5):

    • i.The probability of bi-gram assumption:

      Ptn=Pt1Pt2t1Pt3t2Pt4t3Ptntn-1,

      i=1nPtiti-1. (4)

    • ii.The assumption of Likelihood probability:

      Pwntn=Pw1t1Pw2t2Pw3t3Pwntn,

      i=1nPwiti. (5)

    • 5.The POS Tagger use equation (6) for estimating the most probable sequence of tag.

      tn=argmaxtnPtnwn:

      argmaxtni=1nPtiti-1Pwiti. (6)

    • 6. Tag Transition probabilities Ptitt-1 defining the probability of going from tag tt-1 to tag ti, as shown in equation (7).

      Ptiti-1=Countti-1,tiCountti-1 (7)

    • 7. Word Emission probabilities Pwiti defining the probability of emitting word wi in tag ti shown in equation (8):

      Pwiti=Countti,wiCountti. (8)

For more details on supervised learning system based on HMM POS tagger and its assumption considered can be found in [7] respectively.

4 Experimental Results

In the subsection below, we present brief discussion on the experimental work conducted based on the corpus, with brief discussion on the result achieved and its analysis.

4.1 Corpus

As discussed in the methodology, the corpus has been manually designed and check by the linguistic expert. In this work dataset or corpus of around of 7,500 words has been used for training and 312 words for testing the HMM based POS Tagger. The corpus of Khasi language for this research has been collected from the online Khasi newspaper from [6], and the data collected comprise of the political news and article news had been used in the corpus. Some sample snap of the dataset that are manually tagged using the respective tagset is shown in Table 2. The right hand side represent the words and the left hand side represent its corresponding tags.

Table 2 Manually tagged dataset for trainning 

Tag Khasi Words
3PSF Ka
CMN Kynhun
COM ba
VST la
TRV phah
EM da
3PSM u
CMN Myntri
ADJ Rangbah
3PSF ka
PPN Punjab
3PSM u
FR Capt.
PPN Amanrinder
PPN Singh
IN hapoh
3PSF ka
ABN jingïalam
POP jong
3PSM u
CMN Myntri
3PSF ka
CMN tnad
FR Water

4.2 Discussion on Some Challenges

During the collection of data and creating the dataset, we encountered some challenges; two main challenges in building the dataset for Khasi discussed in this paper are orthography and ambiguity. In 'Orthography', the major problem is the spelling consistency. Spellings in Khasi have not been fully standardized. Different authors spell differently for the same words and that many words that are spelt alike have different meanings and are pronounced differently in different context. Whereas ambiguities are found in categorizing words that are spelt and pronounced alike but differ in categories when they are used in a sentence.

Assigning labels or tags to each word of a given sentence is a difficult task, because there are words that represent more than one grammatical part of speech. The challenge of POS tagging is the 'Ambiguity'.

Some other structural are also there, but are kept out of this paper as they are not within the scope of the paper.

Keeping in mind these problems, we tried to consider by checking and accounting these problems wisely and built the dataset accordingly. The assumption that is considered for the word categories is based on the root category and prefix information. Therefore accordingly, we have designed the data sets to account these problems. Below are some of the examples cited, based on the challenges which are mentioned above:

For example: Orthography problem is as follows.

It is found that the orthography is very complex problem in Khasi language, as in some context words are different compare with the other; like the word ia in some context it is written as ïa. There are many such words that are spell and written different by different people, some more of those words are like:

ïadei, ia dei, Khamtam, Kham tam, eiei, ei ei, ïatreilang, ia trei lang, watla, wat la, and so on.

For example: Ambiguity problem is as follows.

The Table 3 below show some of the ambiguous words of Khasi language:

Table 3 Some of the Khasi ambiguous words 

Khasi word Meaning POS class
Kot book noun
Kot reach verb
Kam work noun
Kam pace verb
Lum hill noun
Lum collect verb
Bah sir noun
Bah enough adjective
Mar as soon as adverb
Mar material noun
Tam pick verb
Tam over adjective
Kham more adjective
Kham hold verb

Therefore according to our need of the data we have been considered and solve the problem.

4.3 Result

Using this HMM POS tagger based on the supervised learning method for the Khasi language, with the corpus of 7,500 words the system yield 76.70% of accuracy as a performance. Due to the unavailability of dataset or corpus and we had to make our own dataset, the accuracy of the tagging can be improved further by creating more data in the corpus. The Table 4 below show the comparison result with the other language using HMM approach.

Table 4 Comparison with others HMM POS tagger 

Sl.
no.
Paper Title Language
Used
Approach corpus training Accuracy
1 HMM based pos
tagger for Hindi
Hindi Hidden Markov
Model
24 tags
3,58,288 words
92%.
2 Development of Marathi
Part of Speech Tagger
Using Statistical
Approach
Marathi Unigram, Bigram,
Trigram and
HMM Methods
26 lexical tags
1, 95,647 words
Unigram 77.38%,
Bigram 90.30%,
Trigram 91.46%
and HMM 93.82%
3 Automatic Part-of-Speech
Tagging for Bengali:
An Approach for
Morphologically Rich Languages
in a Poor
Resource Scenario Approach
Bengali Maximum Entropy (ME)
and
HMM Methods
45,000 words 76.8%
4 Web-based
Bengali News Corpus
for Lexicon Development
and POS Tagging
Bengali HMM
and
SVM Methods
0.128 million words SVM 1.23%
HMM 85.56%
5 Part of Speech
Tagger for
Assamese Text
Assamese HMM Method 10,000 words 87%
6 Identification of POS Tag
for Khasi Language based
on Hidden Markov Model
POS Tagger
(Our proposed method)
Khasi HMM Method 7,500 words 76.70%

4.4 Result Analysis

A part from the correct tagged words, in the experimental result some errors have also been analyses. It is found that some words are tagged incorrectly with the tag that does not belong to the respective word. As we have tag manually the Khasi words with respect to the content of context, therefore some words are tagged wrongly by the system due to ambiguity problem.

For example, the word:

bha is tagged as ADJ or AD.

Namar is tagged as COC or SUC.

Hapdeng is tagged as IN or ADP.

In the result it is found that for the words: ia it is tagged 12 times as InP and it is tagged 2 times as IN, bah is tagged as CMN 3 times and 1 time as ADJ, dang is tagged 1 time as AD and 1 time as VPP. Due to the present of ambiguous words in the annotated corpus it reduces the result accuracy. Therefore, with more annotated data there is high chance that the system will improve the result.

5 Conclusion and Future Works

As very limited works have been done for Khasi language in NLP till date, and tagging from the semantic and technical problems cited above have not been discussed at all, this work culminated as paper that addressed these issues from a larger perspective of NLP. The problems and issues being raised in this paper does not solve all the problems encounter in the POS tagging of Khasi, therefore, there is a future scope of accounting the other problems in future research.

Therefore, we aim to improve the result of the system by introducing more data in the corpus, for both training and testing. We also aim to develop some syntactic rules for Khasi language and to employ them in POS tagger for good results. This will help us to evaluate good performance of the POS tagger of Khasi language.

Acknowledgement

Authors would like to acknowledge the Centre for Natural Language Processing, National Institute of Technology Silchar, 788010, Assam, India for this work.

References

1. Antony, P., Mohan, S. P., & Soman, K. (2010). SVM based part of speech tagger for Malayalam. Recent Trends in Information, Telecommunication and Computing (ITC), 2010 International Conference on, IEEE, pp. 339-341. [ Links ]

2. Chowdhury, G. G. (2003). Natural language processing. Annual review of information science and technology, Vol. 37, No. 1, pp. 51-89. [ Links ]

3. Dandapat, S., Sarkar, S., & Basu, A. (2007). Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, pp. 221-224. [ Links ]

4. Ekbal, A. & Bandyopadhyay, S. (2008). Web-based Bengali news corpus for lexicon development and POS tagging. Polibits, Vol. 37, pp. 21-30. [ Links ]

5. Joshi, N., Darbari, H., & Mathur, I. (2013). HMM based POS tagger for Hindi. Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013), pp. 341-349. [ Links ]

6. Mawphor (2017). Mawphor. [Online; accessed 07-Nov-2017]. [ Links ]

7. Pakray, P., Majumder, G., & Pathak, A. (2018). An HMM based pos tagger for POS tagging of code-mixed Indian social media text. Annual Convention of the Computer Society of India, Springer, pp. 495-504. [ Links ]

8. Patil, H., Patil, A., & Pawar, B. (2014). Part-of-Speech tagger for Marathi language using limited training corpora. IJCA Proceedings on National Conference on Recent Advances in Information Technology NCRAIT (4), Citeseer, pp. 33-37. [ Links ]

9. Patra, B. G., Debbarma, K., Das, D., & Bandyopadhyay, S. (2012). Part of Speech (POS) tagger for Kokborok. Proceedings of COLING 2012: Posters, pp. 923-932. [ Links ]

10. PVS, A. & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Shallow Parsing for South Asian Languages, Vol. 21, pp. 21-24. [ Links ]

11. Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009). Part of speech tagger for Assamese text. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Association for Computational Linguistics, pp. 33-36. [ Links ]

12. Singh, T. D. & Bandyopadhyay, S. (2008). Morphology driven Manipuri POS tagger. Proceedings of the IJCNLP-08 Workshop on NLP for less privileged languages, pp. 91-97. [ Links ]

13. Tham, M. J. (2012). Design considerations for developing a parts-of-speech tagset for Khasi. Emerging Trends and Applications in Computer Science (NCETACS), 2012 3rd National Conference on, IEEE, pp. 277-280. [ Links ]

14. Tham, M. J. (2013). Preliminary investigation of a morphological analyzer and generator for Khasi. Emerging Trends and Applications in Computer Science (ICETACS), 2013 1st International Conference on, IEEE, pp. 256-259. [ Links ]

15. Warjri, S., Pakray, P., Lyngdoh, S., & Kumar Maji, A. (2018). Khasi language as dominant Part-of-Speech (POS) ascendant in nlp. International Journal of Computational Intelligence & IoT, Vol. 1, No. 1, pp. 109-115. [ Links ]

16. Wikipedia contributors (2018). Khasi language - Wikipedia, the free encyclopedia. [Online; accessed 02-Feb-2018]. [ Links ]

17. Wilks, Y. & Stevenson, M. (1998). The grammar of sense: Using 6 part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering, Vol. 4, No. 2, pp. 135-143. [ Links ]

Received: January 30, 2019; Accepted: March 20, 2019

* Corresponding author is Sunita Warjri. sunitawarjri@gmail.com

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License