1870-9044

S1870-90442011000200013

India

00 12 2011

44 79 83

Inference of Fine–grained Attributes of Bengali Corpus for Stylometry Detection

Tanmoy Chakraborty* and Sivaji Bandyopadhyay**

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India (e–mail: *its_tanmoy@yahoo.co.in, **sivaji_cse_ju@yahoo.com).

Manuscript received November 7, 2010.
Manuscript accepted for publication February 6, 2011.

]]> Abstract

Stylometry, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and belongs to the core task of Text categorization that involves authorship identification, plagiarism detection, forensic investigation, computer security, copyright and estáte disputes etc. In this work, we present a strategy for stylometry detection of documents written in Bengali. We adopt a set of fine–grained attribute features with a set of lexical markers for the analysis of the text and use three semi–supervised measures for making decisions. Finally, a majority voting approach has been taken for final classification. The system is fully automatic and language–independent. Evaluation results of our attempt for Bengali author' s stylometry detection show reasonably promising accuracy in comparison to the baseline model.

Key words: Stylometry, stylistic markers, cosine–similarity, chi–square measure, Euclidean distance.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] D. Holmes, "Authorship Attribution," Computers and the Humanities, 28, 87–106, 1994. [ Links ]

[2] D.J. Croft, "Book of Mormon word prints reexamined," Sunstone Publish., 6, 15–22, 1981. [ Links ]

[3] D. Pavelec, E. Justino, and L.S. Oliveira, "Author Identification using Stylometric features," Inteligencia Artificial, Revista Ideroamericana de Inteligencia Artifical, 11, 59–65, 2007. [ Links ]

[4] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, "Automatic authorship attribution," in Proc. of the 9th Conference on European Chapter of the ACL, 1999, pp. 158–165. [ Links ]

[5] K. H. Krippendorf, Contení Analysis–An Introduction to its Methodology, Sage Publications Inc., 2nd Edition, 440 p., 2003. [ Links ]

[6] M.B. Malyutov, "Authorship attribution of texts: A review," Lecture Notes in Computer Science, vol. 4123, 362–380, 2006. [ Links ]

[7] S. Argamon, M. Saric, and S.S. Stien, "Style mining of electronic messages for múltiple authorship discrimination: First results," in Proc. 9th ACM SIGKDD, 2003, pp. 475–180. [ Links ]

[8] T.K. Mustafa, N. Mustapha, M.A. Azmi, and N.B. Sulaiman, "Dropping down the Máximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation," Journal of Computer Science, 6 (3), 235–243, 2010. [ Links ]

[9] T. Zhang, F. Damerau, and D. Johnson, "Text chunking using regularized winnow," in Proc. 39th Annual Meeting onACL, 2002, pp. 539–546. [ Links ]

[10] T. Chakraborty and S. Bandyopadhyay, "Identification of Reduplication in Bengali Corpus and their semantic Analysis: A Rule Based Approach," in Proc. of the COLINO (MWE 2010), Beijing, 2010, pp. 72–75. [ Links ]

[11] V. H. Halteren, "Linguistic profiling for author recognition and verification," in Proceedings of the 2005 Meeting of the Association for Computational Linguistics (ACL), 2005. [ Links ]

[12] S. Argamon, M. Saric, and S. S. Stein, "Style mining of electronic messages for multiple authorship discrimination: First results," in Proceedings of the 2003 Association for Computing Machinery Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2003, pp. 475–480. [ Links ]

[13] D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and L. Ye, "Author identification on the large scale," in Proceedings of the 2005 Meeting of the Classification Society of North America (CSNA), 2005. [ Links ]

[14] M. Koppel, J. Schler, and E. Bonchek–Dokow, "Measuring differentiability: Unmasking pseudonymous authors," Journal of Machine Learning Research, 8, 1261–1276, 2007. [ Links ]

]]>

1994 28

87-106

1981 6

15-22

2007 11

59-65

1999

158-165

2003 2nd

440

2006 4123

362-380

2003

475-180

2010 6 3 3

235-243

2002

539-546

2010

72-75

2005

2003

475-480

2005

2007 8

1261-1276