Aggregation of Similarity Measures for Ortholog Detection: Validation with Measures Based on Rough Set Theory

Millo Sánchez, Reinier; Galpert Cañizares, Deborah; Casa Cardoso, Gladys; Grau Ábalo, Ricardo; Arco García, Leticia; García Lorenzo, María Matilde; Fernandez Marin, Miguel Ángel

doi:10.13053/CyS-18-1-2014-016

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.18 no.1 Ciudad de México ene./mar. 2014

https://doi.org/10.13053/CyS-18-1-2014-016

Artículos

Agregación de medidas de similitud para la detección de ortólogos: validación con medidas basadas en la teoría de conjuntos aproximados

Aggregation of Similarity Measures for Ortholog Detection: Validation with Measures Based on Rough Set Theory

Reinier Millo Sánchez¹, Deborah Galpert Cañizares¹, Gladys Casa Cardoso¹, Ricardo Grau Ábalo¹, Leticia Arco García¹, María Matilde García Lorenzo¹, and Miguel Ángel Fernandez Marin²

¹ Universidad Central "Marta Abreu"de Las Villas, Santa Clara, Cuba. rmillo@uclv.cu

² Universidad de las Ciencias Informáticas, La Habana, Cuba.

Resumen

En el presente trabajo se propone un algoritmo para la detección de ortólogos que utiliza la agregación de medidas de similitud para caracterizar la relación entre los pares de genes de dos genomas. Las medidas se basan en la puntuación del alineamiento, la longitud de las secuencias, la pertenencia a regiones conservadas y el perfil físico-químico de las proteínas. La fase de agrupamiento sobre el grafo bipartido de similitudes se realiza con el algoritmo de agrupamiento de Markov (MCL). Se define una política de asignación de ortólogos a partir de los grupos de homología obtenidos del agrupamiento. La clasificación se valida con los genomas de Saccharomyces Cerevisiae y de Schizosaccharomyces Pombe usando la lista de ortólogos del algoritmo INPARANOID 7.0, con la medida de validación externa ARI. También se aplican medidas de validación empleando la teoría de conjuntos aproximados para medir la calidad con manejo del desbalance de las clases.

Palabras clave: Medidas de similitud, genes ortólogos, agrupamiento mcl, asignación de ortólogos, teoría de conjuntos aproximados, desbalance de las clases.

Abstract

This paper presents a novel algorithm for ortholog detection that involves the aggregation of similarity measures characterizing the relationship between gene pairs of two genomes. The measures are based on the alignment score, the length of the sequences, the membership in the conserved regions as well as on the protein physicochemical profile. The clustering step over the similarity bipartite graph is performed by using the Markov clustering algorithm (MCL). A new ortholog assignment policy is applied over the homology groups obtained in the graph clustering. The classification results are validated with the Saccharomyces Cerevisiae and the Schizosaccharomyces Pombe genomes with the ortholog list of the INPARANOID 7.0 algorithm with the Adjusted Rand Index (ARI) external measure. Other validation measures based on the rough set theory are applied to calculate the quality of the classification dealing with class imbalance.

Keywords: Similarity measures, ortholog genes, mcl clustering, ortholog assignment, rough set theory, class imbalance.

DESCARGAR ARTÍCULO EN FORMATO PDF

Referencias

1. Achelis, S. B. (1995). Technical Analysis from A to Z. McGraw-Hill. [ Links ]

2. Altschul, S. F., Gish, W., Miller, W., Myers, W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal Molecular Biology, 215, 403—410. [ Links ]

3. Arco, L. (2008). Agrupamiento basado en la intermediación diferencial y su valorización utilizando la teoría de los conjuntos aproximados. Tesis de doctorado, Universidad Central "Marta Abreu"de Las Villas, Santa Clara. [ Links ]

4. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing. 6-17. [ Links ]

5. Bondy, J. A. & Murty, U. S. R. (1976). Graph Theory with Applications. North-Holland. [ Links ]

6. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., & Dougherty, E. R. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40, 807-824. [ Links ]

7. Carpio-Munoz, C. A. D. & Carbajal, J. C. (2002). Folding pattern recognition in proteins using spectral analysis methods. Genome Informatics, 13, 163-172. [ Links ]

8. Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., & Jiang, T. (2005). Assignment of orthologous genes via genome rearrangement. IEEE-ACM transactions on computational biology and bioinformatics, 2(4), 302-315. [ Links ]

9. Darling, A. C., Mau, B., & Blattner, F. R. (2004). Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Research, 14(7), 1394-1403. [ Links ]

10. Darling, A. E., Mau, B., & Perna, N. T. (2010). progressivemauve: Multiple genome alignment with gene gain, loss and rearrangement. PLOS One, 5(6). [ Links ]

11. Deza, E. & Deza, M. (2006). Dictionary of Distances. Elsevier. [ Links ]

12. Diestel, R. (2000). Graph Teory. Springer. [ Links ]

13. Dongen, S. M. v. (2000). Graph Clustering by Flow Simulation. Phd thesis, Faculty Letteren, University Utrecht, Amsterdam. [ Links ]

14. Duch, W. (2000). Similarity-based methods: a general framework for classification, approximation and association. Control and Cybernetics, 29(4), 1-30. [ Links ]

15. Fred, A. L. & Jain, A. K. (2003). Robust data clustering. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 3. 128-136. [ Links ]

16. Fu, Z., Chen, X., Vacic, V., Nan, P., Zhong, Y., & Jiang, T. (2007). Msoar: A high-throughput ortholog assignment system based on genome rearrangement. Journal of Computational Biology, 14, 16. [ Links ]

17. Galpert, D. (2012). A local-global gene comparison for ortholog detection in two closely related eukaryotes species. Investigacion de Operaciones, 33(2), 130-140. [ Links ]

18. Goodstadt, L. & Ponting, C. P. (2006). Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLOS Computational Biology, 2(9). [ Links ]

19. Hagelsieb, G. M. & Latimer, K. (2008). Blast options for better detection of orthologs as reciprocal best hits. Bioinformatics, 24, 319-324. [ Links ]

20. Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193-218. [ Links ]

21. Kamvysselis, M. (2003). Computational comparative genomics genes, regulation, evolution. Phd thesis, Massachusetts Institute of Technology. [ Links ]

22. Komorowski, J., Pawlak, Z., & Polkowski, L. (1999). Rough sets: a tutorial, in rough-fuzzy hybridization: A new trend in decision making. Springer-Verlang, Singapore. [ Links ]

23. Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced data sets: One-sided sampling. In 14th International Conference on Machine Learning. 179-186. [ Links ]

24. Lee, Y., Sultana, R., Pertea, G., & Cho, J. (2002). Cross-referencing eukaryotic genomes: Tigr orthologous gene alignments (toga). Genome Research, 12(3), 493-502. [ Links ]

25. Li, L., Stoeckert, C. J., & Roos, D. S. (2003). Orthomcl: Identiication of ortholog groups for eukaryotic genomes. Genome Research, 13, 2178-2189. [ Links ]

26. Liu, Y. & Shriberg, E. (2007). Comparing evaluation metrics for sentence boundary detection. In IEEE International Conference on Acoustics, Speech and Signal Processing. 185-188. [ Links ]

27. Metz, C. (1978). Basic principles of roc analysis. Seminars in Nuclear Medicine, 8(4), 283-298. [ Links ]

28. Miyazawa, S. & Jernigan, R. L. (1985). Estimation of effedtive inter-residue contact energies from protein crystal structures quasi-chemical approximation. Macromolecules, 18, 534-552. [ Links ]

29. Mount, D. W. (2004). Bioinformatics Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. [ Links ]

30. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal MolecularBiology, 48(3). [ Links ]

31. O'Brien, K. P., Remm, M., & Sonnhammer., E. L. (2005). Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Research, 33, D476-D480. [ Links ]

32. Ostlund, G., Schmitt, T., Forslund, K., & Kostler, T. (2010). Inparanoid 7: new algorithm and tools for eukaryotic orthology analysis. Nucleic Acids Research, 38(Database issue), D196-D203. [ Links ]

33. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D., & Maltsev, N. (1999). The use of gene clusters to infer functional coupling. In Proceedings of the National Academy of Sciences of the United States of America, volume 96. 2896-2901. [ Links ]

34. Pal, A. D., Dovier, A., & Fogolari, F. (2003). Protein folding in clp(fd) with empirical contact energies. In Joint Annual Workshop of the ERCIM Working Group on Constraints and the CoLogNET area on Constraints and Logic Programming, In Recent Advances in Constraints. Springer Verlang, Budapest, Hungary, 250-265. [ Links ]

35. Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11(5), 341-356. [ Links ]

36. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. [ Links ]

37. Pawlak, Z. (1995). Vagueness and uncertainty: a rough set perspective. Computational Intelligence: an International Journal, 11, 227-232. [ Links ]

38. Rand, W. (1971 ). Objective criteria for the evaluation of clustering methods. American Statistical Association, 66(336), 846-850. [ Links ]

39. Rasmussen, M. & Kellis, M. (2005). Multi-bus: An algorithm for resolving multi-species gene correspondence and gene family relationships. CSAIL Research. [ Links ]

40. Remm, M., Storm, C. E. V., & Sonnhammer, E. L. L. (2001 ). Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal Molecular Biology, 314, 1041-1052. [ Links ]

41. Santos, J. M. & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In ICANN'09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II. [ Links ]

42. Shulcloper, J. R., Guzman-Arenas, A., & Martinez-Trinidad, J. F. (1995). Enfoque lógico combinatorio al reconocimiento de patrones: Selección de variables y clasificación supervisada. Instituto Politécnico Nacional. [ Links ]

43. Slowinski, R. & Vanderpooten, D. (1997). Similarity relation as a basis for rough approximations. In Wang, P., editor, Advances in Machine Intelligence & Soft-Computing. 17-33. [ Links ]

44. Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular sequences. Journal Molecular Biology, 147, 195-197. [ Links ]

45. Tatusov, R. L. (2003). The cog database: an updated version includes eukaryotes. BMC Bioinformatics, 4(41). [ Links ]

46. Tatusov, R. L., Koonin, E. V., & Lipman, D. J. (1997). A genomic perspective on protein families. Science, 278(5338). [ Links ]

47. Towfic, F., Greenlee, M. H. W., & Honavar, V. (2009). Detection of gene orthology based on protein-protein interaction networks. In IEEE International Conference on Bioinformatics and Biomedicine. Washington DC, USA, 48-53. [ Links ]

48. van Rijsbergen, C. J. (1979). Information retrieval. Butterworths, 2nd edition edition. [ Links ]

49. Webber, C. A. P. & Chris, P. (2004). Genes and homology. Current Biology, 14(R332). [ Links ]

50. Weiss, G. M. & Provost, F. (2003). Learning when trining data are costly: The effect of class distribution on tree induction. Journal Artificial Intelligence Research, 19, 315-354. [ Links ]

51. Yoon, K. & Kwek, S. (2005). An unsupervised learning approach to resolving the data imbalanced issue in suppervised learning problems in functional genomics. In Proceedings of the Fifth International Conference on Hybrid Intelligent Systems. 303-308. [ Links ]