Fast Most Similar Neighbor (MSN) classifiers for Mixed Data

Hernández Rodríguez, Selene

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.14 n.1 Ciudad de México Jul./Sep. 2010

Resumen de tesis doctoral

Fast Most Similar Neighbor (MSN) classifiers for Mixed Data

Clasificadores Rápidos basados en el algoritmo del Vecino más Similar (MSN) para Datos Mezclados

Graduated: Selene Hernández Rodríguez
E mail: selehdez@ccc.inaoep.mx
National Institute of Astrophysics, Optics and Electronics
Luis Enrique Erro # 1, Santa María Tonantzintla,
C.P. 72840, Puebla, México.

Advisor: José Fco. Martínez Trinidad
E mail: fmartine@inaoep.mx
National Institute of Astrophysics, Optics and Electronics
Luis Enrique Erro # 1, Santa María Tonantzintla,
C.P. 72840, Puebla, México.

Advisor: Jesús Ariel Carrasco Ochoa
E mail: ariel@inaoep.mx
National Institute of Astrophysics, Optics and Electronics
Luis Enrique Erro # 1, Santa María Tonantzintla,
C.P. 72840, Puebla, México.

Abstract

The k nearest neighbor (k–NN) classifier has been extensively used in Pattern Recognition because of its simplicity and its good performance. However, in large datasets applications, the exhaustive k–NN classifier becomes impractical. Therefore, many fast k–NN classifiers have been developed; most of them rely on metric properties (usually the triangle inequality) to reduce the number of prototype comparisons. Hence, the existing fast k–NN classifiers are applicable only when the comparison function is a metric (commonly for numerical data). However, in some sciences such as Medicine, Geology, Sociology, etc., the prototypes are usually described by qualitative and quantitative features (mixed data). In these cases, the comparison function does not necessarily satisfy metric properties. For this reason, it is important to develop fast k most similar neighbor (k–MSN) classifiers for mixed data, which use non metric comparisons functions. In this thesis, four fast k–MSN classifiers, following the most successful approaches, are proposed. The experiments over different datasets show that the proposed classifiers significantly reduce the number of prototype comparisons.

Keywords: Nearest neighbor rule, fast nearest neighbor search, mixed data, non–metric comparison functions.

Resumen

El clasificador k vecinos más cercanos (k–NN) ha sido ampliamente utilizado dentro del Reconocimiento de Patrones debido a su simplicidad y buen funcionamiento. Sin embargo, en aplicaciones en las cuales el conjunto de entrenamiento es muy grande, la comparación exhaustiva que realiza k–NN se vuelve inaplicable. Por esta razón, se han desarrollado diversos clasificadores rápidos k–NN; la mayoría de los cuales se basan en propiedades métricas (en particular la desigualdad triangular) para reducir el número de comparaciones entre prototipos. Por lo cual, los clasificadores rápidos k–NN existentes son aplicables solamente cuando la función de comparación es una métrica (usualmente con datos numéricos). Sin embargo, en algunas ciencias como la Medicina, Geociencias, Sociología, etc., los prototipos generalmente están descritos por atributos numéricos y no numéricos (datos mezclados). En estos casos, la función de comparación no siempre cumple propiedades métricas. Por esta razón, es importante desarrollar clasificadores rápidos basados en la búsqueda de los k vecinos más similares (k–MSN) para datos mezclados que usen funciones de comparación no métricas. En esta tesis, se proponen cuatro clasificadores rápidos k–MSN, siguiendo los enfoques más exitosos. Los experimentos con diferentes bases de datos muestran que los clasificadores propuestos reducen significativamente el número de comparaciones entre prototipos.

Palabras clave: Regla del vecino más cercano, búsqueda rápida del vecino más cercano, datos mezclados, funciones de comparación no métricas.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Adler, M., & Heeringa, B. (2008). Search Space Reductions for Nearest–Neighbor Queries. Theory and Applications of Models of Computation. Lecture Notes in Computer Science, 4978, 554–567. [ Links ]

2. Arya, S., Mount, D., Netanyahu, N., Silverman, R., & Wu, A (1998). An optimal algorithm for approximate nearest neighbor searching in high dimensions. Journal of the ACM, 45(6), 891–923. [ Links ]

3. Athitsos, V., Alon, J., & Sclaroff, S. (2005). Efficient Nearest Neighbour Classification Using Cascade of Approximate with Similarity Measures. IEEE Conference on Computer Vision and Pattern Recognition 2005, Washington, USA, 486–493. [ Links ]

4. Beckmann, N., Kriegel, H., Schneider, R., & Seeger, B. (1990). The R*–Tree: An Efficient and Robust Access Method for Points and Rectangles. ACM SIGMOD Record 19 (2), New Jersey, USA, 322–331. [ Links ]

5. Blake, C., & Merz, C. (1998). UCI Repository of machine learning databases. [http://archive.ics.uci.edu/ml/datasets.html], Department of Information and Computer Science, University of California, Irvine, CA, January 2006. [ Links ]

6. Chávez E., & Navarro G. (2005). A compact space decomposition for effective metric indexing. Pattern Recognition Letters, 26(9), 1363–1376. [ Links ]

7. Cheng, D., Gersho, A., Ramamurthi, B., & Shoham, Y. (1984). Fast search algorithms for vector quantization and pattern matching. IEEE International Conference on Acoustics, Speech and Signal Processing, California, USA, 372–375. [ Links ]

8. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. [ Links ]

9. Denny, M., & Franklin, M.J. (2006). Operators for Expensive Functions in Continuous Queries. 22^ndInternational Conference on Data Engineering ICDE06, Georgia, USA, 147–147. [ Links ]

10. Dietterich, T. (1998). Statistical Tests for comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895–1923. [ Links ]

11. D'haes, W., Dyck, D., and Rodel, X. (2002) PCA–based branch and bound search algorithms for computing k nearest neighbors. Pattern Recognition Letters, 24(9–10), 1437–1451. [ Links ]

12. Figueroa, K., Chávez, E., Navarro, G., and Paredes, R. (2006). On the last cost for proximity searching in metric spaces. Workshop on Experimental Algorithms. Lecture Notes in Computer Science, 4007, 279–290. [ Links ]

13. Fredriksson K. (2007). Engineering efficient metric indexes. Pattern Recognition Letters, 28(1), 75–84. [ Links ]

14. Friedman J. H., Bentley J. L., & Finkel R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226. [ Links ]

15. Fukunaga, K., & Narendra, P. (1975). A branch and bound algorithm for computing k–nearest neighbors. IEEE Transactions on Computers, C–24(7), 750–753. [ Links ]

16. García–Serrano, J. R., & Martínez–Trinidad, J. F. (1999). Extension to C–Means Algorithm for the use of Similarity Functions. European Conference on Principles of Data Mining and Knowledge Discovery. Lectures Notes in Artificial Intelligence, 1704, 354–359. [ Links ]

17. Goh K., Li B., & Chang E. (2002). DynDex: A Dynamic and Non–metric Space Indexer. Proceedings of the tenth ACM international conference on Multimedia, Juan–les–Pins, France, 466–475. [ Links ]

18. Gómez–Ballester, E., Mico, L., and Oncina, J. (2006). Some approaches to improve tree–based nearest neighbor search algorithms. Pattern Recognition, 39(2), 171–179. [ Links ]

19. Guttman, A. (1984). R–trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD International Conference on Management of Data, New York, USA, 47–57. [ Links ]

20. Hernández–Rodríguez, S., Martínez–Trinidad, J., & Carrasco–Ochoa, A. (2007). Fast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure. Iberoamerican congress on Pattern Recognition. Lecture Notes in Computer Science, 4756, 407–416. [ Links ]

21. Hernández–Rodríguez, S., Martínez–Trinidad, J., & Carrasco–Ochoa, A. (2007). Fast Most Similar Neighbor Classifier for Mixed Data. The 20th Canadian Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence, 4509, 146–158. [ Links ]

22. Hernández–Rodríguez, S., Martínez–Trinidad, J., & Carrasco–Ochoa, A. (2008). Fast k Most Similar Neighbor Classifier for Mixed Data based on Approximating and Eliminating. Pacific–Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence, 5012, 697–704. [ Links ]

23. Hernández–Rodríguez, S., Martínez–Trinidad, J., & Carrasco–Ochoa, A. (2008). Fast k Most Similar Neighbor Classifier for Mixed Data based on a Tree Structure and Approximating–Eliminating. 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications, Lecture Notes in Computer Science, 5197, 364–371. [ Links ]

24. Hernández–Rodríguez, S., Martínez–Trinidad, J., & Carrasco–Ochoa, A. (2008). On the Selection of Base Prototypes for LAESA and TLAESA Classifier. 19^th International Conference on Pattern Recognition. Florida, USA, 407–416. [ Links ]

25. Hwang W., & Wen K. (2002). Fast kNN classification algorithm based on partial distance search. Electronics Letters, 34(21), 2062–2063. [ Links ]

26. Kalantari, I., & McDonald, G. (1983) A data structure and an algorithm for the nearest point problem. IEEE Transactions on Software Engineering, 9(5), 631–634. [ Links ]

27. Katayama, N., & Satoh, S. (1997). The sr–tree: An index structure for high–dimensional nearest neighbor queries. ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, 369–380. [ Links ]

28. McNames, J. (2001). A Fast Nearest Neighbor Algorithm Based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9), 964–976. [ Links ]

29. Micó, L., Oncina, J., and Vidal, E. (1994). A new version of the nearest–neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing–time and memory requirements. Pattern Recognition Letters, 15(1), 9–17. [ Links ]

30. Mico, L., Oncina, J., & Carrasco, R. (1996). A fast Branch and Bound nearest neighbor classifier in metric spaces. Pattern Recognition Letters, 17(7), 731–739. [ Links ]

31. Moreno–Seco, F., Mico, L., & Oncina, J. (2003). Approximate Nearest Neighbor Search with the Fukunaga and Narendra Algorithm and its Application to Chromosome Classification. Iberoamerican congress on Pattern Recognition, Lecture Notes in Computer Science 2905, 322–328. [ Links ]

32. Nene, S. A., & Nayar, S. K. (1997). A simple algorithm for nearest neighbour search in high dimensions. IEEE Transactions in Pattern Analysis and Machine Intelligence, 19(9), 989–1003. [ Links ]

33. Omachi, S., & Aso, H. (2000). A fast algorithm for a k–NN Classifier based on branch and bound method and computational quantity estimation. Systems and Computers in Japan, 31(6), 1–9. [ Links ]

34. Oncina, J., Thollard, F., Gómez–Ballester, E. Micó, L., & Moreno–Seco, F. (2007). A Tabular Pruning Rule in Tree–Based Fast Nearest Neighbor Search Algorithms. Iberian Conference on Pattern Recognition and Image Analysis. Lecture Notes in Computer Science, 4478, 306–313. [ Links ]

35. Panigrahi, R. (2008), An Improved Algorithm Finding Nearest Neighbor Using Kd–trees. 8th Latin American conference on Theoretical informatics. Lecture Notes in Computer Science, 4957, 387–398. [ Links ]

36. Ramasubramanian, V., & Paliwal, K. (2000). Fast Nearest–Neighbor Search Algorithms based on Approximation–Elimination search. Pattern Recognition 33(9), 1497–1510. [ Links ]

37. Tokoro, K., Yamaguchi, K., & Masuda, S. (2006). Improvements of TLAESA Nearest Neighbour Search Algorithm and Extension to Approximation Search. 29^th Australasian Computer Science Conference, Hobart, Australia, 48, 77–83. [ Links ]

38. Uhlmann, J. (1991). Metric trees. Applied Mathematics Letters, 4(5), 61–62. [ Links ]

39. Vidal, E. (1986). An algorithm for finding nearest neighbours in (approximately) constant average time complexity. Pattern Recognition Letters, 4(3), 145–157. [ Links ]

40. White, D., & Jain, R. (1996). Similarity indexing with the ss–tree. ICDE '96: Twelfth International Conference on Data Engineering, Washington, USA, 516–523. [ Links ]

41. Wilson, D., & Martínez, T. (2000). Reduction techniques for instance based learning algorithms. Machine Learning, 38, 257–286. [ Links ]

42. Yianilos, P. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. SODA '93: Fourth annual ACM–SIAM Symposium on Discrete algorithms, Philadelphia, USA, 311–321. [ Links ]

43. Yong–Sheng, C., Yi–Ping, H., & Chiou–Shann, F. (2007). Fast and versatile algorithm for nearest neighbor search based on lower bound tree, Pattern Recognition Letters, 40(2), 360–375. [ Links ]

44. Yunck T. (1976). A technique to identify nearest neighbors. IEEE Transactions on Systems, Man and Cybernetics, 6(10), 678–683. [ Links ]

45. Zhang B., & Srihari S. (2004). Fast k– nearest neighbour classification using cluster–based tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4), 5. [ Links ]