Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Warjri, Sunita; Pakray, Partha; Lyngdoh, Saralin; Kumar Maji, Arnab; Warjri, Sunita; Pakray, Partha; Lyngdoh, Saralin; Kumar Maji, Arnab

doi:10.13053/cys23-3-3248

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.23 no.3 Ciudad de México jul./sep. 2019 Epub 09-Ago-2021

https://doi.org/10.13053/cys23-3-3248

Articles of the Thematic Issue

Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Sunita Warjri¹^*

Partha Pakray²

Saralin Lyngdoh¹

Arnab Kumar Maji¹

^¹ North-Eastern Hill University, Shillong, Meghalaya, India. sunitawarjri@gmail.com, saralyngdoh@gmail.com, arnab.maji@gmail.com.

^² National Institute of Technology, Silchar, Assam, India. parthapakray@gmail.com.

Abstract

Computational Linguistic (CL) becomes an essential and important amenity in the present scenarios, as many different technologies are involved in making machines to understand human languages. Khasi is the language which is spoken in Meghalaya, India. Many Indian languages have been researched in different fields of Natural Language Processing (NLP), whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS tagging as one of the key aspects of NLP, we present POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage of building NLP system for Khasi, with the analyses of the categories and structures of the words is started. Therefore, we have designed specific POS tagsets to categories Khasi words and vocabularies. Then, the POS system based on HMM is trained by using Khasi words which have been tagged manually using the designed tagsets. As ambiguity is one of the main challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first few sets of data in the experimental data by using the HMM tagger we found out that the result yielded by this model is 76.70% of accurate.

Keywords: Natural language processing (NLP); computational linguistic; part of speech (POS); POS tagger; hidden Markov model (HMM)

1 Introduction

Natural Language Processing (NLP) deals with the inter-relation and inter-communication between the computer and natural human language by combining the technology of artificial intelligent and computer science. The most important part of any NLP task is the issue of understanding the natural language. Application of NLP helps machines to learn, read and understand the human language, by simulating the human ability of understanding the language and by combining the technology of computational linguistics, computer science and artificial intelligence.

The most basic and important starting level of NLP for any language is POS tagging. POS in language processing is the aspect that deals with the identification of grammatical class of each word in a given sentence. POS is used in many fields of NLP such as Semantic Disambiguation, Phrase identification (chunking), Named Entity Recognition, Information Extraction, Parsing, etc. ^[²^,¹⁷^]. The task of creating POS Tagger involves many stages such as: building tagsets, creating dictionary, considering the rules of the context and also checking the inflexions, dependent anomalies of the particular language.

Though, substantial work has already been carried out in different fields of NLP for Indian Language, Khasi lacks such study from the NLP perspective. Other Indian languages like Hindi, Bengali, Assamese, Manipuri, Marathi, Tamil, etc. have already been employed in Computational linguistic. In this paper we present POS tagger for Khasilanguage. KhasiLanguage is an official language of the state Meghalaya, in North East India ^[¹⁶^].

The name 'Khasi' classify both the tribe and as well as the language. Khasi Language is spoken in the Khasi Hills district of the state Meghalaya, India. Also, this language is spoken in the border area of Assam -Meghalaya as well as India-Bangladesh border. Khasiis a part of Mon-Khmer family which is branch of Austro-Asiatic, Southeast of Asia. There have been very limited research works in computational linguistics with regard to Khasi Language.

To perform research work on NLP there is a need of corpus in Khasi Language. Dataset or Corpus is basically a large and extensive collection of texts or words, which are used for analysis of any Natural Language. Corpus is an essential component for any Natural Language Processing research. Therefore, in this paper we describe tagger for Khasi Language based on supervised system HMM trained model.

The paper is organized as follows: Section 2 describes related works on POS Tagging; Section 3 describes Methodology used in HMM based POS Tagger system; Section 4 describes experimental results; Section 5 Conclusions and some future perspectives.

2 Literature Review

In this section, the existing relevant works in the POS tagging of different languages are presented. There are several existing research works on POS tagging on different languages such as Indian, English, German, Spanish, etc. whereby many researchers have also proposed different methods for POS tagging and show the achieved results.

A Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Hindi language as discussed in ^[⁵^]. Indian Language (IL) POS tag set have been employed for the system. In the experimental result the HHM POS tagger acquires accuracy of 92%.

A POS tagging for Manipuri language demonstrated 69% of accuracy by using morphology driven POS tagger in ^[¹²^]. This POS tagger uses 10917 unique words and three dictionaries consisting of prefixes, suffixes and root words and also information with respect to the text content.

Part-of-speech (POS) Tagger for Malayalam Language using supervised learning based on Support Vector Machine (SVM)as discussed in ^[¹^]. For the tagger, tag-set consisting 29 tags was developed. The corpus with dataset consisting 180,000 words, the system achieved 94% of accuracy.

Part-of-Speech Tagging for Marathi Language discussed in ^[⁸^] using Training data of 576 words with tag-set of 9 tags. Tokenization, Morphological analysis and Disambiguation process have been carried out for POS tagging. The author concluded with the attained accuracy of 78.82% by the system.

Kokborok language based on rule based, Conditional Random Field (CRF) and Support Vector Machines (SVM) for Part of Speech (POS) Tagger discussed in ^[⁹^]. A tagged dataset of 42,537 words with 26 tag were used. The POS taggers methods attains the accuracies of 69% for rule based, 81.67% CRF based and 84.46% for SVM.

In ^[¹⁰^] has discussed POS Tagging and Chunking using Conditional Random Fields and Transformation. The POS tagger accuracy achieved for CRF and TBL of about 77.37% for Telugu, 78.66% for Hindi, and 76.08% for Bengali and the chunker performance accuracy of 79.15% for Telugu, 80.97% for Hindi and 82.74% for Bengali respectively.

Part-of-Speech (POS) tagger for Bengali language in ^[³^] has been reported. Tagging based on Hidden Markov Model (HMM) and Maximum Entropy (ME) stochastic taggers has been discussed. Study is conducted to improve the efficiency and performance of the tagger by using a morphological analyzer. An accuracy of 76.80% has been reported. The author reported to achieved good performance with the suffix information and morphological restriction on the grammatical categories for the supervised learning model.

In the paper ^[⁴^] the POS tagger based on SVM and HMM for Bengali has been proposed. The author show the result as the accuracy of 79.6% for the manually checked corpus consisting 0.128 million words.

The result of developed POS taggers were given as the accuracies of 85.56% for HMM, and 91.23% for SVM.

Part of Speech tagging for Assamese was reported in paper ^[¹¹^]. The system was design based on Hidden Markov Model approach (HMM). With tagsets consisting 172 tags and corpus consisting 10000 words which were manually tagged for training the system. The accuracy of 87% was achieved by the HMM POS tagger.

For khasi language the author in ^[¹³^] had introduce tagsets consisting 61 tags. In the paper ^[¹⁴^] the author had also introduce Morphological Analyzer for Khasi language. Using morphotactic rules the author had used dictionary consisting 8000 words. The analyzing system use based word of the word classes with grammatical relationship of subject verb object. For deriving the morphotactic rules the prefixes, infixes and suffixes of the words had been made used.

3 Methodology for HMM POS tagger

In this paper, the POS tagger for Khasi language based on Hidden Markov Model (HMM) as supervised learning has been employed. In the subsection below, we present the discussion about the methods that have been carried out in this work for building the supervised POS Tagger:

3.1 Tag Sets

Tag is the label that is used to describe a grammatical class of information (these are: nouns, verbs, pronouns, prepositions, and so on). For example: NN could represent the Noun class, JJ the Adjective class, PRP the Pronoun class, etc. Each language has a different pattern, frequency, and speaking style. Thus, the grammatical class is also different for different languages. Therefore, for this work we have designed tagset consisting 54 tags for identifying the grammatical class or Part-of-Speech (POS) of Khasi Language ^[¹⁵^] as listed in the Table 1 below.

Table 1 POS tagset for Khasi Language ^[¹⁵^]

No.	Tag	Description
1	PPN	Proper nouns
2	CLN	Collective nouns
3	CMN	Common nouns
4	MTN	Material nouns
5	ABN	Abstract nouns
6	RFP	Reflexive Pronoun
7	EM	Emphatic Pronoun
8	RLP	Relative Pronouns
9	INP	Interrogative Pronouns
10	DMP	Demonstrative Pronouns
11	POP	Possessive Pronoun
12	CAV	Causative Verb
13	TRV	Transitive verb
14	ITV	Intransitive verb
15	DTV	Ditransitive verb
16	ADJ	Adjective
17	CMA	Comparative Adjective marker
18	SPA	Superlative Adjective marker
19	AD	Adverb
20	ADT	Adverb of Time
21	ADM	Adverb of Manner
22	ADP	Adverb of Place
23	ADF	Adverb of frequency
24	ADD	Adverb of degree
25	IN	Preposition
26	1PSG	1st Person singular common gender
27	1PPG	1st Person plural common gender
28	2PG	2ndPerson singular/plural common gender
29	2PF	2ndPerson singular/plural Feminine gender
30	2PM	2ndPerson singular/plural Masculine gender
31	3PSF	3rd Person singular Feminine gender
32	3PSM	3rd Person singular Masculine gender
33	3PPG	3rd Person plural common Gender
34	3PSG	3rd Person singular common Gender
35	VPT	Verb, present tense
36	VPP	Verb, present progressive participle
37	VST	Verb, past tense
38	VSP	Verb, past perfective participle
39	VFT	Verb, future tense
40	Mod	Modalities
41	Neg	Negation
42	CLF	Classifier
43	COC	Coordinating conjunction
44	SUC	Subordinating conjunction
45	CRC	Correlative conjunction
46	CN	Cardinal Number
47	ON	Ordinal number
48	QNT	Quantifiers
49	CO	Copula
50	InP	Infinitive Participle
51	PaV	Passive Voice
52	COM	Complementizer
53	FR	Foreign words
54	SYM	Symbols

For more details regarding the designed tag-sets of Khasi language can be found in ^[¹⁵^] respectively.

3.2 Data Sets for Corpus Building

In linguistics, the corpus is the large number of data or written texts, which is collected for analyzing in computational linguistic. For this work, the corpus consists of the collected words and each word consists with its corresponding tags. It is found that no such Khasi corpus is available till date.

Therefore, in this work the Khasi corpus has been built. The corpus consists of Khasi language based on context, by collecting the written text from online Khasi newspaper. These raw texts are then tag appropriately to each word by using the designed tagset respectively. Tagging to words at this stage has been done manually. Therefore, we were able to create 7,500 words in our data set.

The corpus has been built painstakingly with care so that it can efficiently handle the problem of Ambiguity and also orthography. In this work, we aim to achieve standard Khasi corpus for POS tagging. Therefore, the dataset have also been built under the observation and validation of a linguistic expert from North Eastern Hill University, Department of Linguistic, Shillong, Meghalaya, India. Then, this dataset has been used for training and testing the tagger.

3.3 POS Identification based on HMM for Khasi Language

This subsection presents the steps for POS tagger based on Hidden Markov Model (HMM) that has been carried out in this work. Part-of-speech tagging is a sequence classification problem. The main objective of this supervised machine learning HMM trained model is to give the most i.e. the maximum probable tag y as outputs for the given word x, as shown in equation (1):

fx=arg⁡maxy∈Y⁡Pyx. (1)

The following are the elements consist in the HMM POS tagger system and the steps followed for calculating the efficiency of the tagger:

A finite set of words, W=w1,w2,…,wn .
A finite tags T=t1,t2,…,tn, .
n is the number in length.

Therefore, from our objective, to find the optimal tags sequence tn^ we have equation (2) and (3):

tn^=arg⁡maxtn⁡Ptnwn (2)

tn^=arg⁡maxtn⁡PtnPwntn. (3)

NOTE: Prior probability - Ptn and Likelihood probability - Pwntn .

4.Two properties of assumptions are considered in HMM based POS taggers, shown in equation (4) and (5):
- i.The probability of bi-gram assumption:
  
  Ptn=Pt1Pt2t1Pt3t2Pt4t3…Ptntn-1,
  
  ≈∏i=1nPtiti-1. (4)
- ii.The assumption of Likelihood probability:
  
  Pwntn=Pw1t1Pw2t2Pw3t3…Pwntn,
  
  ≈∏i=1nPwiti. (5)
- 5.The POS Tagger use equation (6) for estimating the most probable sequence of tag.
  
  tn=argmaxtnPtnwn:
  
  ≈arg⁡maxtn⁡∏i=1nPtiti-1Pwiti. (6)
- 6. Tag Transition probabilities Ptitt-1 defining the probability of going from tag tt-1 to tag t_i, as shown in equation (7).
  
  Ptiti-1=Countti-1,tiCountti-1 (7)
- 7. Word Emission probabilities Pwiti defining the probability of emitting word w_i in tag t_i shown in equation (8):
  
  Pwiti=Countti,wiCountti. (8)

For more details on supervised learning system based on HMM POS tagger and its assumption considered can be found in ^[⁷^] respectively.

4 Experimental Results

In the subsection below, we present brief discussion on the experimental work conducted based on the corpus, with brief discussion on the result achieved and its analysis.

4.1 Corpus

As discussed in the methodology, the corpus has been manually designed and check by the linguistic expert. In this work dataset or corpus of around of 7,500 words has been used for training and 312 words for testing the HMM based POS Tagger. The corpus of Khasi language for this research has been collected from the online Khasi newspaper from ^[⁶^], and the data collected comprise of the political news and article news had been used in the corpus. Some sample snap of the dataset that are manually tagged using the respective tagset is shown in Table 2. The right hand side represent the words and the left hand side represent its corresponding tags.

Table 2 Manually tagged dataset for trainning

Tag	Khasi Words
3PSF	Ka
CMN	Kynhun
COM	ba
VST	la
TRV	phah
EM	da
3PSM	u
CMN	Myntri
ADJ	Rangbah
3PSF	ka
PPN	Punjab
3PSM	u
FR	Capt.
PPN	Amanrinder
PPN	Singh
IN	hapoh
3PSF	ka
ABN	jingïalam
POP	jong
3PSM	u
CMN	Myntri
3PSF	ka
CMN	tnad
FR	Water

4.2 Discussion on Some Challenges

During the collection of data and creating the dataset, we encountered some challenges; two main challenges in building the dataset for Khasi discussed in this paper are orthography and ambiguity. In 'Orthography', the major problem is the spelling consistency. Spellings in Khasi have not been fully standardized. Different authors spell differently for the same words and that many words that are spelt alike have different meanings and are pronounced differently in different context. Whereas ambiguities are found in categorizing words that are spelt and pronounced alike but differ in categories when they are used in a sentence.

Assigning labels or tags to each word of a given sentence is a difficult task, because there are words that represent more than one grammatical part of speech. The challenge of POS tagging is the 'Ambiguity'.

Some other structural are also there, but are kept out of this paper as they are not within the scope of the paper.

Keeping in mind these problems, we tried to consider by checking and accounting these problems wisely and built the dataset accordingly. The assumption that is considered for the word categories is based on the root category and prefix information. Therefore accordingly, we have designed the data sets to account these problems. Below are some of the examples cited, based on the challenges which are mentioned above:

For example: Orthography problem is as follows.

It is found that the orthography is very complex problem in Khasi language, as in some context words are different compare with the other; like the word ia in some context it is written as ïa. There are many such words that are spell and written different by different people, some more of those words are like:

ïadei, ia dei, Khamtam, Kham tam, eiei, ei ei, ïatreilang, ia trei lang, watla, wat la, and so on.

For example: Ambiguity problem is as follows.

The Table 3 below show some of the ambiguous words of Khasi language:

Table 3 Some of the Khasi ambiguous words

Khasi word	Meaning	POS class
Kot	book	noun
Kot	reach	verb
Kam	work	noun
Kam	pace	verb
Lum	hill	noun
Lum	collect	verb
Bah	sir	noun
Bah	enough	adjective
Mar	as soon as	adverb
Mar	material	noun
Tam	pick	verb
Tam	over	adjective
Kham	more	adjective
Kham	hold	verb

Therefore according to our need of the data we have been considered and solve the problem.

4.3 Result

Using this HMM POS tagger based on the supervised learning method for the Khasi language, with the corpus of 7,500 words the system yield 76.70% of accuracy as a performance. Due to the unavailability of dataset or corpus and we had to make our own dataset, the accuracy of the tagging can be improved further by creating more data in the corpus. The Table 4 below show the comparison result with the other language using HMM approach.

Table 4 Comparison with others HMM POS tagger

Sl. no.	Paper Title	Language Used	Approach	corpus training	Accuracy
1	HMM based pos tagger for Hindi	Hindi	Hidden Markov Model	24 tags 3,58,288 words	92%.
2	Development of Marathi Part of Speech Tagger Using Statistical Approach	Marathi	Unigram, Bigram, Trigram and HMM Methods	26 lexical tags 1, 95,647 words	Unigram 77.38%, Bigram 90.30%, Trigram 91.46% and HMM 93.82%
3	Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario Approach	Bengali	Maximum Entropy (ME) and HMM Methods	45,000 words	76.8%
4	Web-based Bengali News Corpus for Lexicon Development and POS Tagging	Bengali	HMM and SVM Methods	0.128 million words	SVM 1.23% HMM 85.56%
5	Part of Speech Tagger for Assamese Text	Assamese	HMM Method	10,000 words	87%
6	Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger (Our proposed method)	Khasi	HMM Method	7,500 words	76.70%

4.4 Result Analysis

A part from the correct tagged words, in the experimental result some errors have also been analyses. It is found that some words are tagged incorrectly with the tag that does not belong to the respective word. As we have tag manually the Khasi words with respect to the content of context, therefore some words are tagged wrongly by the system due to ambiguity problem.

For example, the word:

bha is tagged as ADJ or AD.

Namar is tagged as COC or SUC.

Hapdeng is tagged as IN or ADP.

In the result it is found that for the words: ia it is tagged 12 times as InP and it is tagged 2 times as IN, bah is tagged as CMN 3 times and 1 time as ADJ, dang is tagged 1 time as AD and 1 time as VPP. Due to the present of ambiguous words in the annotated corpus it reduces the result accuracy. Therefore, with more annotated data there is high chance that the system will improve the result.

5 Conclusion and Future Works

As very limited works have been done for Khasi language in NLP till date, and tagging from the semantic and technical problems cited above have not been discussed at all, this work culminated as paper that addressed these issues from a larger perspective of NLP. The problems and issues being raised in this paper does not solve all the problems encounter in the POS tagging of Khasi, therefore, there is a future scope of accounting the other problems in future research.

Therefore, we aim to improve the result of the system by introducing more data in the corpus, for both training and testing. We also aim to develop some syntactic rules for Khasi language and to employ them in POS tagger for good results. This will help us to evaluate good performance of the POS tagger of Khasi language.

Acknowledgement

Authors would like to acknowledge the Centre for Natural Language Processing, National Institute of Technology Silchar, 788010, Assam, India for this work.

References

1. Antony, P., Mohan, S. P., & Soman, K. (2010). SVM based part of speech tagger for Malayalam. Recent Trends in Information, Telecommunication and Computing (ITC), 2010 International Conference on, IEEE, pp. 339-341. [ Links ]

2. Chowdhury, G. G. (2003). Natural language processing. Annual review of information science and technology, Vol. 37, No. 1, pp. 51-89. [ Links ]

3. Dandapat, S., Sarkar, S., & Basu, A. (2007). Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, pp. 221-224. [ Links ]

4. Ekbal, A. & Bandyopadhyay, S. (2008). Web-based Bengali news corpus for lexicon development and POS tagging. Polibits, Vol. 37, pp. 21-30. [ Links ]

5. Joshi, N., Darbari, H., & Mathur, I. (2013). HMM based POS tagger for Hindi. Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013), pp. 341-349. [ Links ]

6. Mawphor (2017). Mawphor. [Online; accessed 07-Nov-2017]. [ Links ]

7. Pakray, P., Majumder, G., & Pathak, A. (2018). An HMM based pos tagger for POS tagging of code-mixed Indian social media text. Annual Convention of the Computer Society of India, Springer, pp. 495-504. [ Links ]

8. Patil, H., Patil, A., & Pawar, B. (2014). Part-of-Speech tagger for Marathi language using limited training corpora. IJCA Proceedings on National Conference on Recent Advances in Information Technology NCRAIT (4), Citeseer, pp. 33-37. [ Links ]

9. Patra, B. G., Debbarma, K., Das, D., & Bandyopadhyay, S. (2012). Part of Speech (POS) tagger for Kokborok. Proceedings of COLING 2012: Posters, pp. 923-932. [ Links ]

10. PVS, A. & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Shallow Parsing for South Asian Languages, Vol. 21, pp. 21-24. [ Links ]

11. Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009). Part of speech tagger for Assamese text. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Association for Computational Linguistics, pp. 33-36. [ Links ]

12. Singh, T. D. & Bandyopadhyay, S. (2008). Morphology driven Manipuri POS tagger. Proceedings of the IJCNLP-08 Workshop on NLP for less privileged languages, pp. 91-97. [ Links ]

13. Tham, M. J. (2012). Design considerations for developing a parts-of-speech tagset for Khasi. Emerging Trends and Applications in Computer Science (NCETACS), 2012 3rd National Conference on, IEEE, pp. 277-280. [ Links ]

14. Tham, M. J. (2013). Preliminary investigation of a morphological analyzer and generator for Khasi. Emerging Trends and Applications in Computer Science (ICETACS), 2013 1st International Conference on, IEEE, pp. 256-259. [ Links ]

15. Warjri, S., Pakray, P., Lyngdoh, S., & Kumar Maji, A. (2018). Khasi language as dominant Part-of-Speech (POS) ascendant in nlp. International Journal of Computational Intelligence & IoT, Vol. 1, No. 1, pp. 109-115. [ Links ]

16. Wikipedia contributors (2018). Khasi language - Wikipedia, the free encyclopedia. [Online; accessed 02-Feb-2018]. [ Links ]

17. Wilks, Y. & Stevenson, M. (1998). The grammar of sense: Using 6 part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering, Vol. 4, No. 2, pp. 135-143. [ Links ]

Received: January 30, 2019; Accepted: March 20, 2019

^* Corresponding author is Sunita Warjri. sunitawarjri@gmail.com

This is an open-access article distributed under the terms of the Creative Commons Attribution License