On-line version ISSN 1870-9044
Polibits no.44 México July/Dec. 2011
Inference of Finegrained Attributes of Bengali Corpus for Stylometry Detection
Tanmoy Chakraborty* and Sivaji Bandyopadhyay**
Manuscript received November 7, 2010.
Manuscript accepted for publication February 6, 2011.
Stylometry, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and belongs to the core task of Text categorization that involves authorship identification, plagiarism detection, forensic investigation, computer security, copyright and estáte disputes etc. In this work, we present a strategy for stylometry detection of documents written in Bengali. We adopt a set of finegrained attribute features with a set of lexical markers for the analysis of the text and use three semisupervised measures for making decisions. Finally, a majority voting approach has been taken for final classification. The system is fully automatic and languageindependent. Evaluation results of our attempt for Bengali author' s stylometry detection show reasonably promising accuracy in comparison to the baseline model.
Key words: Stylometry, stylistic markers, cosinesimilarity, chisquare measure, Euclidean distance.
 D. Holmes, "Authorship Attribution," Computers and the Humanities, 28, 87106, 1994. [ Links ]
 D.J. Croft, "Book of Mormon word prints reexamined," Sunstone Publish., 6, 1522, 1981. [ Links ]
 D. Pavelec, E. Justino, and L.S. Oliveira, "Author Identification using Stylometric features," Inteligencia Artificial, Revista Ideroamericana de Inteligencia Artifical, 11, 5965, 2007. [ Links ]
 E. Stamatatos, N. Fakotakis, and G. Kokkinakis, "Automatic authorship attribution," in Proc. of the 9th Conference on European Chapter of the ACL, 1999, pp. 158165. [ Links ]
 K. H. Krippendorf, Contení AnalysisAn Introduction to its Methodology, Sage Publications Inc., 2nd Edition, 440 p., 2003. [ Links ]
 M.B. Malyutov, "Authorship attribution of texts: A review," Lecture Notes in Computer Science, vol. 4123, 362380, 2006. [ Links ]
 S. Argamon, M. Saric, and S.S. Stien, "Style mining of electronic messages for múltiple authorship discrimination: First results," in Proc. 9th ACM SIGKDD, 2003, pp. 475180. [ Links ]
 T.K. Mustafa, N. Mustapha, M.A. Azmi, and N.B. Sulaiman, "Dropping down the Máximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation," Journal of Computer Science, 6 (3), 235243, 2010. [ Links ]
 T. Zhang, F. Damerau, and D. Johnson, "Text chunking using regularized winnow," in Proc. 39th Annual Meeting onACL, 2002, pp. 539546. [ Links ]
 T. Chakraborty and S. Bandyopadhyay, "Identification of Reduplication in Bengali Corpus and their semantic Analysis: A Rule Based Approach," in Proc. of the COLINO (MWE 2010), Beijing, 2010, pp. 7275. [ Links ]
 V. H. Halteren, "Linguistic profiling for author recognition and verification," in Proceedings of the 2005 Meeting of the Association for Computational Linguistics (ACL), 2005. [ Links ]
 S. Argamon, M. Saric, and S. S. Stein, "Style mining of electronic messages for multiple authorship discrimination: First results," in Proceedings of the 2003 Association for Computing Machinery Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2003, pp. 475480. [ Links ]
 D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and L. Ye, "Author identification on the large scale," in Proceedings of the 2005 Meeting of the Classification Society of North America (CSNA), 2005. [ Links ]
 M. Koppel, J. Schler, and E. BonchekDokow, "Measuring differentiability: Unmasking pseudonymous authors," Journal of Machine Learning Research, 8, 12611276, 2007. [ Links ]