Introducing Biases in Document Clustering

Ramírez-Cruz, Yunior

doi:10.13053/CyS-18-1-2014-024

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.1 Ciudad de México Jan./Mar. 2014

https://doi.org/10.13053/CyS-18-1-2014-024

Artículos

Introducing Biases in Document Clustering

Introducción de sesgos en el agrupamiento de documentos

Yunior Ramírez-Cruz

Center for Pattern Recognition and Data Mining, Content Management Systems Division, DATYS, Santiago de Cuba, Cuba. yunior@cerpamid.co.cu

Abstract

In this paper, we present three criteria for introducing biases in document clustering algorithms, when information characterizing the document collections is available. We focus on collections known to be the result of a document categorization or sample-based document filtering process. Our proposals rely on profiles, i.e., document samples known to have been used for obtaining the collection, to extract statistics which determine the biases to introduce. We conduct an experimental evaluation over a number of collections extracted from the widely used corpus RCV1, which allows us to confirm the validity of our proposals and determine a number of situations where biased clusterings, according to different criteria, outperform their unbiased counterparts.

Keywords. Document clustering, introduc biases.

Resumen

En este artículo se presentan tres criterios para la introducción de sesgos en algoritmos de agrupamiento de documentos, cuando se dispone de información que caracteriza las colecciones de documentos. Nos concentramos en colecciones de las que se conoce que son el resultado de un proceso de categorización o filtrado de documentos basado en muestras. Nuestras propuestas utilizan perfiles, es decir muestras de documentos de las que se conoce que han sido utilizadas para obtener la colección, para extraer estadísticos que determinan los sesgos a introducir. Llevamos a cabo una evaluación experimental sobre un conjunto de colecciones extraídas del corpus ampliamente utilizado RCV1, que nos permiten confirmar la validez de nuestras propuestas y determinar un número de situaciones donde los agrupamientos sesgados según diferentes criterios superan a sus contrapartes no sesgadas.

Palabras clave. Agrupamiento de documentos, introducción de sesgos.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Carpineto, C., Osinski, S., Romano, G. & Weiss, D. (2009). A Survey of Web Clustering Engines. ACM Computing Surveys 41(3). [ Links ]

2. Ramírez-Cruz, Y. (2013). Assessing the Effect of Introducing Biases in Document Clustering, Proceedings of the XV International Convention and Fair Informática 2013. [ Links ]

3. Kyriakopoulou, A. & Kalamboukis, T. (2006). Text Classification Using Clustering. Proceedings of the Discovery Challenge Workshop at ECML/PKDD 2006, 28-38. [ Links ]

4. Kalton, A., Wagstaff, K. & Yoo, J. (2001). Generalized Clustering, Supervised Learning, and Data Assignment. Proceedings of the ACM SIGKDD Seventh International Conference on Knowledge Discovery and Data Mining, 299-304. [ Links ]

5. J. Hartigan & Wong, M. (1979). Algorithm AS136: A K-Means clustering algorithm. Applied Statistics 28, 100-108. [ Links ]

6. Palmer, C. & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering. Proceedings of the ACM SIGMOD 19th International Conference on Management of Data, 82-92. [ Links ]

7. Salton, G., Wong, A. & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613-620. [ Links ]

8. Lindstone, G.J. (1920). Note on the General Case of the Bayes-Laplace Formula for Inductive or a Posteriori Probabilities. Transactions of the Faculty of Actuaries 8, 182-192. [ Links ]

9. Lewis, D.D., Yang, Y., Rose, T. & Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361-397. [ Links ]

10. Halkidi, M., Batistakis, Y. & Vazirgiannis, M. (2001). On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2-3), 107-145. [ Links ]

11. van Rijsbergen, C.J. (1979). Information Retrieval, London: Butterworths. [ Links ]

12. López-Caviedes, M. & Sánchez-Díaz, G. (2004). A New Clustering Criterion in Pattern Recognition. WSEAS Transactions on Computers 3(3), 558562. [ Links ]

13. Hill, D.R. (1968). A vector clustering technique. Mechanized Information Storage, Retrieval and Dissemination. [ Links ]

14. Martínez-Trinidad, J.F., Ruiz-Shulcloper, J. & Lazo-Cortés, M. (2000). Structuralization of Universes. Fuzzy Sets and Systems 112(3), 485-500. [ Links ]

15. Gil-García, R., Badía-Contelles, J.M. & Pons-Porrata, A. (2003). Extended Star Clustering Algorithm. Lecture Notes on Computer Science 2905, 480-487. [ Links ]

16. Pons-Porrata, A., Sánchez-Díaz, G., Lazo-Cortés, M. & Alfonso-Ramírez, L. (2005). An Incremental Clustering Algorithm based on Compact Sets with Radius alpha. Lecture Notes on Computer Sciences 3773, 302-310. [ Links ]

17. Sparck-Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation 28(1), 11-21. [ Links ]

18. Efron, B. & Tibshirani, R. (1993). An Introduction to the Bootstrap. London: Chapman and Hall/CRC Press. [ Links ]