Inference of Fine-grained Attributes of Bengali Corpus for Stylometry Detection

Chakraborty, Tanmoy; Bandyopadhyay, Sivaji

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.44 México jul./dic. 2011

Inference of Fine–grained Attributes of Bengali Corpus for Stylometry Detection

Tanmoy Chakraborty* and Sivaji Bandyopadhyay**

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India (e–mail: *its_tanmoy@yahoo.co.in, **sivaji_cse_ju@yahoo.com).

Manuscript received November 7, 2010.
Manuscript accepted for publication February 6, 2011.

Abstract

Stylometry, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and belongs to the core task of Text categorization that involves authorship identification, plagiarism detection, forensic investigation, computer security, copyright and estáte disputes etc. In this work, we present a strategy for stylometry detection of documents written in Bengali. We adopt a set of fine–grained attribute features with a set of lexical markers for the analysis of the text and use three semi–supervised measures for making decisions. Finally, a majority voting approach has been taken for final classification. The system is fully automatic and language–independent. Evaluation results of our attempt for Bengali author' s stylometry detection show reasonably promising accuracy in comparison to the baseline model.

Key words: Stylometry, stylistic markers, cosine–similarity, chi–square measure, Euclidean distance.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] D. Holmes, "Authorship Attribution," Computers and the Humanities, 28, 87–106, 1994. [ Links ]

[2] D.J. Croft, "Book of Mormon word prints reexamined," Sunstone Publish., 6, 15–22, 1981. [ Links ]

[3] D. Pavelec, E. Justino, and L.S. Oliveira, "Author Identification using Stylometric features," Inteligencia Artificial, Revista Ideroamericana de Inteligencia Artifical, 11, 59–65, 2007. [ Links ]

[4] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, "Automatic authorship attribution," in Proc. of the 9th Conference on European Chapter of the ACL, 1999, pp. 158–165. [ Links ]

[5] K. H. Krippendorf, Contení Analysis–An Introduction to its Methodology, Sage Publications Inc., 2nd Edition, 440 p., 2003. [ Links ]

[6] M.B. Malyutov, "Authorship attribution of texts: A review," Lecture Notes in Computer Science, vol. 4123, 362–380, 2006. [ Links ]

[7] S. Argamon, M. Saric, and S.S. Stien, "Style mining of electronic messages for múltiple authorship discrimination: First results," in Proc. 9th ACM SIGKDD, 2003, pp. 475–180. [ Links ]

[8] T.K. Mustafa, N. Mustapha, M.A. Azmi, and N.B. Sulaiman, "Dropping down the Máximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation," Journal of Computer Science, 6 (3), 235–243, 2010. [ Links ]

[9] T. Zhang, F. Damerau, and D. Johnson, "Text chunking using regularized winnow," in Proc. 39th Annual Meeting onACL, 2002, pp. 539–546. [ Links ]

[10] T. Chakraborty and S. Bandyopadhyay, "Identification of Reduplication in Bengali Corpus and their semantic Analysis: A Rule Based Approach," in Proc. of the COLINO (MWE 2010), Beijing, 2010, pp. 72–75. [ Links ]

[11] V. H. Halteren, "Linguistic profiling for author recognition and verification," in Proceedings of the 2005 Meeting of the Association for Computational Linguistics (ACL), 2005. [ Links ]

[12] S. Argamon, M. Saric, and S. S. Stein, "Style mining of electronic messages for multiple authorship discrimination: First results," in Proceedings of the 2003 Association for Computing Machinery Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2003, pp. 475–480. [ Links ]

[13] D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and L. Ye, "Author identification on the large scale," in Proceedings of the 2005 Meeting of the Classification Society of North America (CSNA), 2005. [ Links ]

[14] M. Koppel, J. Schler, and E. Bonchek–Dokow, "Measuring differentiability: Unmasking pseudonymous authors," Journal of Machine Learning Research, 8, 1261–1276, 2007. [ Links ]