1405-5546

S1405-55462014000100011

10.13053/CyS-18-1-2014-024

Cuba

00 03 2014

18 1 137 151

Artículos

Introducing Biases in Document Clustering

Introducción de sesgos en el agrupamiento de documentos

Yunior Ramírez-Cruz

Center for Pattern Recognition and Data Mining, Content Management Systems Division, DATYS, Santiago de Cuba, Cuba. yunior@cerpamid.co.cu

]]>

Abstract

In this paper, we present three criteria for introducing biases in document clustering algorithms, when information characterizing the document collections is available. We focus on collections known to be the result of a document categorization or sample-based document filtering process. Our proposals rely on profiles, i.e., document samples known to have been used for obtaining the collection, to extract statistics which determine the biases to introduce. We conduct an experimental evaluation over a number of collections extracted from the widely used corpus RCV1, which allows us to confirm the validity of our proposals and determine a number of situations where biased clusterings, according to different criteria, outperform their unbiased counterparts.

Keywords. Document clustering, introduc biases.

Resumen

En este artículo se presentan tres criterios para la introducción de sesgos en algoritmos de agrupamiento de documentos, cuando se dispone de información que caracteriza las colecciones de documentos. Nos concentramos en colecciones de las que se conoce que son el resultado de un proceso de categorización o filtrado de documentos basado en muestras. Nuestras propuestas utilizan perfiles, es decir muestras de documentos de las que se conoce que han sido utilizadas para obtener la colección, para extraer estadísticos que determinan los sesgos a introducir. Llevamos a cabo una evaluación experimental sobre un conjunto de colecciones extraídas del corpus ampliamente utilizado RCV1, que nos permiten confirmar la validez de nuestras propuestas y determinar un número de situaciones donde los agrupamientos sesgados según diferentes criterios superan a sus contrapartes no sesgadas.

Palabras clave. Agrupamiento de documentos, introducción de sesgos.

DESCARGAR ARTÍCULO EN FORMATO PDF

]]>

References

1. Carpineto, C., Osinski, S., Romano, G. & Weiss, D. (2009). A Survey of Web Clustering Engines. ACM Computing Surveys 41(3). [ Links ]

2. Ramírez-Cruz, Y. (2013). Assessing the Effect of Introducing Biases in Document Clustering, Proceedings of the XV International Convention and Fair Informática 2013. [ Links ]

3. Kyriakopoulou, A. & Kalamboukis, T. (2006). Text Classification Using Clustering. Proceedings of the Discovery Challenge Workshop at ECML/PKDD 2006, 28-38. [ Links ]

4. Kalton, A., Wagstaff, K. & Yoo, J. (2001). Generalized Clustering, Supervised Learning, and Data Assignment. Proceedings of the ACM SIGKDD Seventh International Conference on Knowledge Discovery and Data Mining, 299-304. [ Links ]

]]>

5. J. Hartigan & Wong, M. (1979). Algorithm AS136: A K-Means clustering algorithm. Applied Statistics 28, 100-108. [ Links ]

6. Palmer, C. & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering. Proceedings of the ACM SIGMOD 19th International Conference on Management of Data, 82-92. [ Links ]

7. Salton, G., Wong, A. & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613-620. [ Links ]

8. Lindstone, G.J. (1920). Note on the General Case of the Bayes-Laplace Formula for Inductive or a Posteriori Probabilities. Transactions of the Faculty of Actuaries 8, 182-192. [ Links ]

9. Lewis, D.D., Yang, Y., Rose, T. & Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361-397. [ Links ]

]]>

10. Halkidi, M., Batistakis, Y. & Vazirgiannis, M. (2001). On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2-3), 107-145. [ Links ]

11. van Rijsbergen, C.J. (1979). Information Retrieval, London: Butterworths. [ Links ]

12. López-Caviedes, M. & Sánchez-Díaz, G. (2004). A New Clustering Criterion in Pattern Recognition. WSEAS Transactions on Computers 3(3), 558562. [ Links ]

13. Hill, D.R. (1968). A vector clustering technique. Mechanized Information Storage, Retrieval and Dissemination. [ Links ]

14. Martínez-Trinidad, J.F., Ruiz-Shulcloper, J. & Lazo-Cortés, M. (2000). Structuralization of Universes. Fuzzy Sets and Systems 112(3), 485-500. [ Links ]

]]>

15. Gil-García, R., Badía-Contelles, J.M. & Pons-Porrata, A. (2003). Extended Star Clustering Algorithm. Lecture Notes on Computer Science 2905, 480-487. [ Links ]

16. Pons-Porrata, A., Sánchez-Díaz, G., Lazo-Cortés, M. & Alfonso-Ramírez, L. (2005). An Incremental Clustering Algorithm based on Compact Sets with Radius alpha. Lecture Notes on Computer Sciences 3773, 302-310. [ Links ]

17. Sparck-Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation 28(1), 11-21. [ Links ]

18. Efron, B. & Tibshirani, R. (1993). An Introduction to the Bootstrap. London: Chapman and Hall/CRC Press. [ Links ]

]]>

2009 41 3 3

2013 20 13

2006 20 06

28-38

2001

299-304

1979 28

100-108

2000

82-92

1975 18 11 11

613-620

1920 8

182-192

2004 5

361-397

2001 17 23 23

107-145

1979

2004 3 3 3

558562

1968

2000 112 3 3

485-500

2003 2905

480-487

2005 3773

302-310

1972 28 1 1

11-21

1993