Generative Manifold Learning for the Exploration of Partially Labeled Data

Cruz-Barbosa, Raúl; Vellido, Alfredo

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.4 Ciudad de México Oct./Dec. 2013

Resumen de tesis

Generative Manifold Learning for the Exploration of Partially Labeled Data

Aprendizaje generativo de variedades para la exploración de datos parcialmente etiquetados

Raúl Cruz-Barbosa¹, Alfredo Vellido²

¹ Instituto de Computación, Universidad Tecnológica de la Mixteca, Huajuapan, Oaxaca, México. rcruz@mixteco.utm.mx

² Departament de Llenguatges i Sistemes Informatics, Universitat Politecnica de Catalunya, Barcelona, Spain. avellido@lsi.upc.edu

Article received on 23/12/2011
Accepted on 18/06/2013

Abstract

In many real-world application problems, the availability of data labels for supervised learning is rather limited and incompletely labeled datasets are commonplace in some of the currently most active areas of research. A manifold learning model, namely Generative Topographic Mapping (GTM), is the basis of the methods developed in the thesis reported in this paper. A variant of GTM that uses a graph approximation to the geodesic metric is first defined. This model is capable of representing data of convoluted geometries. The standard GTM is here modified to prioritize neighbourhood relationships along the generated manifold. This is accomplished by penalizing the possible divergences between the Euclidean distances from the data points to the model prototypes and the corresponding geodesic distances along the manifold. The resulting Geodesic GTM (Geo-GTM) model is shown to improve the continuity and trustworthiness of the representation generated by the model, as well as to behave robustly in the presence of noise. We then proceed to define a novel semi-supervised model, SS-Geo-GTM, that extends Geo-GTM to deal with semi-supervised problems. In SS-Geo-GTM, the model prototypes obtained from Geo-GTM are linked by the nearest neighbour to the data manifold. The resulting proximity graph is used as the basis for a class label propagation algorithm. The performance of SS-Geo-GTM is experimentally assessed via accuracy and Matthews correlation coefficient, comparing positively with an Euclidean distance-based counterpart and the alternative Laplacian Eigenmaps and semi-supervised Gaussian mixture models.

Keywords: Semi-supervised learning, Clustering, Generative Topographic Mapping, Exploratory Data Analysis.

Resumen

En muchos problemas aplicados del mundo real, la disponibilidad de etiquetas de los datos para el aprendizaje supervisado es bastante limitada y los conjuntos de datos etiquetados incompletamente son habituales en algunas de las áreas de investigación actualmente mas activas. Un modelo de aprendizaje de variedades, el Mapeo Topográfico Generativo (GTM como acrónimo del nombre en inglés), es la base de los métodos desarrollados en la tesis reportada en este artículo. Se define en primer lugar una extensión de GTM que utiliza una aproximacion de grafos para la métrica geodésica. Este modelo es capaz de representar datos de geometría intrincada. El GTM estándar se modifica aquí para priorizar relaciones de vecindad a lo largo de la variedad generada. Esto se logra penalizando las divergencias posibles entre las distancias euclideanas de los puntos de datos a los prototipos del modelo y las distancias geodésicas correspondientes a lo largo de la variedad. Se muestra aquí que el modelo GTM geodésico (Geo-GTM) resultante mejora la continuidad y la fiabilidad de la representacion generada por el modelo, al igual que se comporta robustamente en presencia de ruido. Después, procedemos a definir un modelo semi-supervisado novedoso, SS-Geo-GTM, que extiende Geo-GTM para tratar problemas semi-supervisados. En SS-Geo-GTM, los prototipos del modelo obtenidos de Geo-GTM son vinculados mediante el vecino mas cercano a la variedad de datos. El grafo de proximidad resultante se utiliza como la base para un algoritmo de propagación de etiquetas de clase. El rendimiento de SS-Geo-GTM se evalúa experimentalmente a travos de las medidas de exactitud y el coeficiente de correlación de Matthews, comparando positivamente con una contraparte basada en la distancia euclideana y con los modelos alternativos de Eigenmapas Laplacianos y mezclas de Gaussianas semi-supervisadas.

Palabras clave: Aprendizaje semi-supervisado, agrupamiento, mapeo topográfico generativo, análisis exploratorio de datos.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgements

R. Cruz-Barbosa acknowledges the Mexican Secretariat of Public Education (SEP-PROMEP program) for his PhD grant.

References

1. Archambeau, C. & Verleysen, M. (2005). Manifold constrained finite gaussian mixtures. In Cabestany, J., Prieto, A., & Sandoval, D. F., editors, Procs. of IWANN, volume LNCS 3512. Springer-Verlag, 820-828. [ Links ]

2. Baghshah, M. S. & Shouraki, S. B. (2010). Kernel-based metric learning for semi-supervised clustering. Neurocomputing, 73, 1352-1361. [ Links ]

3. Baluja, S. (1998). Probabilistic modeling for face orientation discrimination: learning from labeled and unlabeled data. In Neural Information Processing Systems. 854-860. [ Links ]

4. Basu, S. (2005). Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments. Ph.D. thesis, The University of Texas at Austin, U.S.A. [ Links ]

5. Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In Proc. of the 19th International Conference on Machine Learning. [ Links ]

6. Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. In Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'04. 59-68. [ Links ]

7. Basu, S., Davidson, I., & Wagstaff, K., editors (2009). Constrained Clustering: Advances in Algorithms, Theory and Applications. Chapman & Hall/CRC Press. [ Links ]

8. Belkin, M. & Niyogi, P. (2002). Using manifold structure for partially labelled classification. In Advances in Neural Information Processing Systems (NIPS) 15. [ Links ]

9. Belkin, M. & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [ Links ]

10. Belkin, M. & Niyogi, P. (2004). Semi-supervised learning on Riemannian manifolds. Machine Learning, 56, 209-239. [ Links ]

11. Bernstein, M., de Silva, V., Langford, J., & Tenenbaum, J. (2000). Graph approximations to geodesics on embedded manifolds. Technical report, Stanford University, CA, U.S.A. [ Links ]

12. Biecek, P., Szczurek, E., Vingron, M., & Tiuryn, J. (2012). The R package bgmm: mixture modeling with uncertain knowledge. Journal of Statistical Software, 47(3), 1-31. [ Links ]

13. Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrating constraints and metric learning in semi-supervised clustering. In Proc. of the 21st International Conference on Machine Learning. [ Links ]

14. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). The Generative Topographic Mapping. Neural Computation, 10(1), 215-234. [ Links ]

15. Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Procs. of the 11th Annual Conference on Computational Learning Theory (COLT 98). 92-100. [ Links ]

16. Chapelle, O., Chi, M., & Zien, A. (2006). A continuation method for semi-supervised SVMs. In Proc. of the 23rd International Conference on Machine Learning (ICML 2006). [ Links ]

17. Chapelle, O., Scholkopf, B., & Zien, A., editors (2006). Semi-Supervised Learning. The MIT Press. [ Links ]

18. Chapelle, O., Sindhwani, V., & Keerthi, S. S. (2008). Optimization techniques for semi-supervised support vector machines. Journal of Machine Learning Research, 9, 203-233. [ Links ]

19. Chawla, N. & Karakoulas, G. (2005). Learning from labeled and unlabeled data: an empirical study across techniques and domains. Journal of Artificial Intelligence Research, 23, 331-366. [ Links ]

20. Cruz-Barbosa, R. & Vellido, A. (2008). Geodesic Generative Topographic Mapping. In Geffner, H., Prada, R., Alexandre, I., & David, N., editors, Proceedings of the 11th Ibero-American Conference on Artificial Intelligence (IBERAMIA 2008), volume 5290 of LNAI. Springer, 113-122. [ Links ]

21. Cruz-Barbosa, R. & Vellido, A. (2010). Semi-supervised geodesic generative topographic mapping. Pattern Recogn Lett, 31, 202-209. [ Links ]

22. Cruz-Barbosa, R. & Vellido, A. (2011). Semi-supervised analysis of human brain tumours from partially labeled MRS information, using manifold learning models. Int J Neural Syst, 21, 17-29. [ Links ]

23. De Bie, T. & Cristianini, N. (2004). Convex methods for transduction. In Thrun, S., Saul, L., & Scholkopf, B., editors, Proc. of Advances in Neural Information Processing Systems 16. MIT Press. [ Links ]

24. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1-38. [ Links ]

25. Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numerische Mathematik, 1, 269-271. [ Links ]

26. Ghahramani, Z. & Jordan, M. I. (1994). Supervised learning from incomplete data via the EM approach. In Advances in Neural Information Processing Systems 6. 120-127. [ Links ]

27. Goldman, S. & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proc. of the 17th International Conference on Machine Learning. Morgan Kaufmann, 327-334. [ Links ]

28. Gorodkin, J. (2004). Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 28, 367-374. [ Links ]

29. Grira, N., Crucianu, M., & Boujemaa, N. (2008). Active semi-supervised fuzzy clustering. Pattern Recognition, 41, 1834-1844. [ Links ]

30. Hein, M. & Audibert, J. Y. (2005). Intrinsic dimensionality estimation of submanifolds in R^d. In Proceedings of the 22nd International Conference on Machine Learning. 289-296. [ Links ]

31. Herrmann, L. & Ultsch, A. (2007). Label propagation for semi-supervised learning in self-organizing maps. In Procs. of the 6th WSOM 2007. [ Links ]

32. Jain, A. K. & Dubes, R. C. (1998). Algorithms for Clustering Data. Prentice Hall, New Jersey. [ Links ]

33. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Procs. of the 16th International Conference on Machine Learning (ICML-99). 200-209. [ Links ]

34. Jones, R. (2005). Learning to extract entities from labeled and unlabeled text. Doctoral Dissertation CMU-LTI-05-191, Carnegie Mellon University. [ Links ]

35. Jurman, G., Riccadonna, S., & Furlanello, C. (2012). A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE, 7(8), e41882. [ Links ]

36. Kohonen, T. (1995). Self-Organizing Maps. Springer-Verlag, Berlin. [ Links ]

37. Kulis, B., Basu, S., Dhillon, I. S., & Mooney, R. J. (2005). Semi-supervised graph clustering: a kernel approach. In Proc. of the 22nd International Conference on Machine Learning. 457-464. [ Links ]

38. Lee, J. & Verleysen, M. (2007). Nonlinear DimensionalityReduction. Springer. [ Links ]

39. Lee, J. A., Lendasse, A., & Verleysen, M. (2002). Curvilinear distance analysis versus isomap. In Procs. of European Symposium on Artificial Neural Networks (ESANN). 185-192. [ Links ]

40. Matthews, B. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta - Protein Structure, 405, 442-451. [ Links ]

41. Miller, D. J. & Uyar, H. S. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In Neural Information Processing Systems (NIPS) 9. 571-577. [ Links ]

42. Nigam, K. (2001). Using unlabeled data to improve text classification. Doctoral Dissertation CMU-CS-01-126, Carnegie Mellon University. [ Links ]

43. Nigam, K. & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Ninth International Conference on Information and Knowledge Management. 86-93. [ Links ]

44. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103-134. [ Links ]

45. Pozdnoukhov, A. & Kanevski, M. (2008). Geokernels: Modeling of spatial data on geomanifolds. In Proc. of the 16th European Symposium on Artificial Neural Networks (ESANN 2008). 277-282. [ Links ]

46. Roweis, S. T. & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323-2326. [ Links ]

47. Seeger, M. (2000). Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, UK. [ Links ]

48. Tang, W., Xiong, H., Zhong, S., & Wu, J. (2007). Enhancing semi-supervised clustering: a feature projection perspective. In Proc. of the 13th ACM SIGKDD International conference on Knowledge discovery and data mining, KDD'07. 707-716. [ Links ]

49. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319-2323. [ Links ]

50. Wang, J. & Shen, X. (2007). Large margin semi-supervised learning. Journal of Machine Learning Research, 8, 1867-1891. [ Links ]

51. Weston, J., Leslie, C., le, E., Zhou, D., Elisseeff, A., & Noble, W. S. (2005). Semi-supervised protein classification using cluster kernels. Bioinformatics, 21, 3241-3247. [ Links ]

52. Xu, L. & Schuurmans, D. (2005). Unsupervised and semi-supervised multi-class support vector machines. In Proc. of the 20th National Conference on Artificial Intelligence (AAAI2005). [ Links ]

53. Xu, Z., Jin, R., Zhu, J., King, l., & Lyu, M. (2008). Efficient convex relaxation for transductive support vector machine. In Platt, J. C., Koller, D., Singer, Y., & Roweis, S., editors, Advances in Neural Information Processing Systems 20. MIT Press, 1641-1648. [ Links ]

54. Yin, X., Chen, S., Hu, E., & Zhang, D. (2010). Semi-supervised clustering with metric learning: An adaptive kernel method. Pattern Recognition, 43, 1320-1333. [ Links ]

55. Zhou, Y. & Goldman, S. (2004). Democratic co-learning. In Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI2004). 594-602. [ Links ]

56. Zhou, Z. H. & Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1529-1541. [ Links ]

57. Zhu, X. (2007). Semi-supervised learning literature survey. Technical Report TR 1530, University of Wisconsin - Madison. [ Links ]

58. Zhu, X. & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Technical report, CMU-CALD-02-107, Carnegie Mellon University. [ Links ]

59. Zhu, X. & Goldberg, A. B. (2009). Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers. [ Links ]