Feature Selection for Microarray Gene Expression Data Using Simulated Annealing Guided by the Multivariate Joint Entropy

González-Navarro, Félix Fernando; Belanche-Muñoz, Lluís A.

doi:10.13053/CyS-18-2-2014-032

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.18 n.2 Ciudad de México Apr./Jun. 2014

https://doi.org/10.13053/CyS-18-2-2014-032

Artículos regulares

Feature Selection for Microarray Gene Expression Data Using Simulated Annealing Guided by the Multivariate Joint Entropy

Selección de características para datos de expresión de los genes mediante microarreglos usando recocido simulado guiado por la entropía conjunta multivariada

Félix Fernando González-Navarro¹ and Lluís A. Belanche-Muñoz²

¹ Instituto de Ingeniería, Universidad Autónoma de Baja California, Mexicali, Mexico fernando.gonzalez@uabc.edu.mx

² Dept. de Llenguatges i Sistemes Informaàtics, Universitat Politècnica de Catalunya, Barcelona, Spain belanche@lsi.upc.edu

Abstract

Microarray classification poses many challenges for data analysis, given that a gene expression data set may consist of dozens of observations with thousands or even tens of thousands of genes. In this context, feature subset selection techniques can be very useful to reduce the representation space to one that is manageable by classification techniques. In this work we use the discretized multivariate joint entropy as the basis for a fast evaluation of gene relevance in a Microarray Gene Expression context. The proposed algorithm combines a simulated annealing schedule specially designed for feature subset selection with the incrementally computed joint entropy, reusing previous values to compute current feature subset relevance. This combination turns out to be a powerful tool when applied to the maximization of gene subset relevance. Our method delivers highly interpretable solutions that are more accurate than competing methods. The algorithm is fast, effective and has no critical parameters. The experimental results in several public-domain microarray data sets show a notoriously high classification performance and low size subsets, formed mostly by biologically meaningful genes. The technique is general and could be used in other similar scenarios.

Keywords: Feature selection, microarray gene expression data, multivariate joint entropy, simulated annealing.

Resumen

La clasificación de microarreglos plantea muchos desafíos para el análisis de datos, dado que un conjunto de datos de expresión de genes puede contener docenas de observaciones con miles o incluso decenas de miles de genes. En este contexto, las técnicas de selección de subconjuntos de características pueden ser muy útiles para reducir el espacio de representación a uno manejable mediante técnicas de clasificación. En este trabajo se utiliza la entropía conjunta discretizada multivariada como base para la evaluación rápida de la relevancia de genes en el contexto de expresión génica mediante microarreglos. El algoritmo propuesto desarrolla una técnica de recocido simulado diseñada especialmente para la selección de subconjuntos de características, a través de la entropía conjunta. Esta es calculada incrementalmente, reutilizando los valores anteriores para calcular la relevancia de los subconjuntos de características. Esta combinación resulta ser una herramienta poderosa cuando se aplica a la maximización de la relevancia de un subconjunto de genes. Nuestro método ofrece soluciones altamente interpretables y más precisas que las propuestas por métodos competidores. El algoritmo propuesto es rápido, eficaz y no presenta parámetros críticos. Los resultados de los experimentos con varios conjuntos de datos de microarreglos de dominio público revelan alto rendimiento de clasificación y subconjuntos de pequeño tamaño, formados en su mayoría por genes biológicamente significativos. La técnica es general y podría ser utilizada en otros escenarios similares.

Palabras clave: Selección de características, datos de expresiones de los genes mediante microarreglos, entropía conjunta multivariada, recocido simulado.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgements

Authors wish to thank Spanish CICyT Project No. CGL2004-04702-C02-02, CONACyT and UABC for supporting this research from its beginning.

References

1. Akinmade, D., Talukder, A., Zhang, Y., Luo, W., Kumar, R., & Hamburger, A. (2008). Phosphorylation of the erbb3 binding protein ebp1 by p21-activated kinase 1 in breast cancer cells. British Journal of Cancer, 98, 1132-1140. [ Links ]

2. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of The National Academy of Sciences USA, volume 96, IEEE, pp. 6745-6750. [ Links ]

3. Bagheri-Yarmand, R., Mandal, M., Taludker, A., Wang, R., Vadlamudi, R., Kung, H., & Kumar, R. (2001). Etk/bmx tyrosine kinase activates pak1 and regulates tumorigenicity of breast cancer cells. Journal of Biological Chemistry, 276(31), 29403-29409. [ Links ]

4. Bell, D. & Wang, H. (2000). A formalism for relevance and its application in feature subset selection. Machine Learning, 41(2), 175-195. [ Links ]

5. Bhattacharya, S., Bunick, C., & Chazin, W. (2004). Target selectivity in ef-hand calcium binding proteins. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research, 1742(1-3), 69-79. [ Links ]

6. Bowen, K., Reimers, A., Luman, S., Kronz, J., Fyffe, W., & Oxford, J. (2008). Immunohistochemical localization of collagen type xi a1 and a2 chains in human colon tissue. Journal of Histochemistry and Cytochemistry, 56(3), 275-283. [ Links ]

7. Braga-Neto, U. & Dougherty, E. R. (2003). Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20(3), 374-380. [ Links ]

8. Bu, H., Li, G., & Zeng, X. (2007). Reducing error of tumor classification by using dimension reduction with feature selection. The First International Symposium on Optimization and Systems Biology (OSB 2007), pp. 232-241. [ Links ]

9. Cai, R., Hao, Z., Yang, X., & Wen, W. (2009). An efficient gene selection algorithm based on mutual information. Neurocomputing, 72, 991-999. [ Links ]

10. Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. Proceedings of the European working session on learning on Machine learning, Springer-Verlag New York, Inc., New York, NY, USA, pp. 164-178. [ Links ]

11. Chakraborty, S. (2009). Simultaneous cancer classification and gene selection with bayesian nearest neighbor method: An integrated approach. Computational Statistics and Data Analysis, 53(4), 14621474. [ Links ]

12. Chang, C. & Lin, C. (2002). Libsvm : a library for support vector machines. In http://www.csie.ntu.edu.tw/~cjlin/libsvm/. [ Links ]

13. Chu, F. & Wang, L. (2005) Applications of support vector machines to cancer classification with microarray data International Journal of Neural Systems, 15(6), 475-484 [ Links ]

14. Chu, W., Ghahramani, Z., Falciani, F., & Wild, D. (2005) Biomarker discovery in microarray gene expression data with gaussian processes Bioinformatics, 21(16), 3385-3393 [ Links ]

15. Delage, B., Fennell, D., Nicholson, L., McNeish, I., Lemoine, N., Crook, T., & Szlosarek, P. (2010) Arginine deprivation and argininosuccinate synthetase expression in the treatment of cancer. International Journal of Cancer, 126(12), 2762-2772. [ Links ]

16. Duan, K., Rajapakse, J., Wang, H., & Azuaje, F. (2005). Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE/ACM Transactions on Nanobioscience, 4(3), 228-234. [ Links ]

17. Farhana, H., Wahalab, K., Adlercreutzc, H., & Cross, H. (2002). Isoflavonoids inhibit catabolism of vitamin d in prostate cancer cells. Journal of Chromatography B, 777(1-2), 261-268. [ Links ]

18. Filippone, M., Masulli, F., & Rovetta, S. (2006). Unsupervised gene selection and clustering using simulated annealing. In Bloch, I., Petrosino, A., & Tettamanzi, A., editors, Fuzzy Logic and Applications, volume 3849 of Lecture Notes in Computer Science. Springer, 229-235. [ Links ]

19. GenCards (2009). Weizmann Institute of Science. http://www.genecards.org/. [ Links ]

20. GeneAtlas (2007). University Rene Descartes - Paris. In http://www.dsi.univ-paris5.fr/genatlas/. [ Links ]

21. Giatromanolaki, A., Koukourakis, M., Sivridis, E., Turley, H., Wykoff, C., Gatter, K., & Harris, A. (2003). Dec1 (stra13) protein expression relates to hypoxia- inducible factor 1-alpha and carbonic anhydrase-9 overexpression in non-small cell lung cancer. The Journal of Pathology, 200(2), 222-228. [ Links ]

22. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., & Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531-537. [ Links ]

23. González, F. & Belanche, L. (2008). A thermo-dynamical search algorithm for feature subset selection. In Ishikawa, M., Doya, K., Miyamoto, H., & Yamakawa, T., editors, Neural Information Processing, volume 4984 of Lecture Notes in Computer Science. Springer, 683-692. [ Links ]

24. González, F. F. & Belanche, L. A. (2011). Parsimonious selection of useful genes in microarray gene expression data. In Arabnia, H. R. & Tran, Q.-N., editors, Software Tools and Algorithms for Biological Systems, volume 696 of Advances in Experimental Medicine and Biology. Springer New York, 45-55. [ Links ]

25. Gordon, G., Jensen, R., Hsiao, L., Gullans, S., Blumenstock, J., Ramaswamy, S., Richards, W., Sugarbaker, D., & Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62, 4963-4967. [ Links ]

26. Hewett, R. & Kijsanayothin, F. (2008). Tumor classification ranking from microarray data. BMC Genomics, 9(2). [ Links ]

27. Hong, J. & Cho, S. (2008). Cancer classification with incremental gene selection based on dna microarray data. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, IEEE, pp. 70-74. [ Links ]

28. Ishikura, H., Ikeda, H., Abe, H., Ohkuri, T., Hi-raga, H., Isu, K., Tsukahara, T., Sato, N., Kita-mura, H., Iwasaki, N., Takeda, N., & Nishimura, A. M. T. (2011). Identification of cluap1 as a human osteosarcoma tumor-associated antigen recognized by the humoral immune system. International Journal of Oncology, 30(2), 225-233. [ Links ]

29. Jong-Seok, M., Won-Ji, J., Jin-Hye, K., Hyo-Jeong, K., Mi-Jin, Y., Jae-Woo, K., Park, P. S. W., & Kyung-Sup, K. (2011). Androgen stimulates glycolysis for de novo lipid synthesis by increasing the activities of hexokinase 2 and 6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase 2 in prostate cancer cells. Biochemical Journal, 433, 225-233. [ Links ]

30. Kirkpatrick, S. (1984). Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics, 34. [ Links ]

31. Kurgan, L. & Cios, K. (2004). Caim discretization algorithm. IEEE Trans. on Knowledge and Data Engineering, 16(2), 145-153. [ Links ]

32. Li, Y. & Liu, Y. (2008). A wrapper feature selection method based on simulated annealing algorithm for prostate protein mass spectrometry data. Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on, pp. 195-200. [ Links ]

33. Lisboa, P., Ellis, I., Green, A., Ambrogi, F., & Dias, M. (2008). Cluster based visualisation with scatter matrices. Pattern Recognition Letters, 29(13), 1814-1823. [ Links ]

34. Lu, Y. & Han, J. (2003). Cancer classification using gene expression data. Information Systems, 28, 243-268. [ Links ]

35. Meiri, R. & Zahavi, J. (2006). Using simulated annealing to optimize the feature selection problem in marketing applications. European Journal ofOp-erational Research, 171(3), 842-858. [ Links ]

36. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21. [ Links ]

37. Munoz, I. & Rouse, J. (2009). Control of histone methylation and genome stability by ptip. EMBO reports, 10. [ Links ]

38. NCBI (2007). National Center of Biotechnology Information. In http://www.ncbi.nlm.nih. [ Links ]

39. Ng, M. & Chan, L. (2005). Informative gene discovery for cancer classification from microarray expression data. IEEE Workshop on Machine Learning for Signal Processing, IEEE, pp. 393-398. [ Links ]

40. Potamias, G., Koumakis, L., & Moustakis, V. (2004). Gene selection via discretized geneexpression profiles and greedy feature-elimination. SETN, pp. 256-266. [ Links ]

41. Reeves, C. R. (1995). Modern Heuristic Techniques for Combinatorial Problems. McGraw Hill. [ Links ]

42. Renata, R., Visser, L., der Leij, J. V., Harms, G. , Blokzijl, T., Deloulme, J., van der Vlies, P., Kamps, W., Kok, K., Lim, M., Poppema, S., & van den Berg, A. (2005). High expression of calcium-binding proteins, s100a10, s100a11 and calm2 in anaplastic large cell lymphoma. British Journal of Haematology, 131(5), 596-608. [ Links ]

43. Ruiz, R., Riquelme, J., & Aguilar, J. (2006). Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognition, 39, 2383-2392. [ Links ]

44. Scherz-Shouval, R., Shvets, E., Fass, E., Shorer, H. , Gil, L., & Elazar, Z. ( 2007). Reactive oxygen species are essential for autophagy and specifically regulate the activity of atg4. The EMBO Journal, 26, 1749-1760. [ Links ]

45. Shah, S. & Kusiak, A. (2007). Cancer gene search with data-mining and genetic algorithms. Comput. Biol. Med., 37(2), 251-261. [ Links ]

46. Shaik, J. & Yeasin, M. (2007). A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics, 8(1). [ Links ]

47. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal., 27, 379-423. [ Links ]

48. Sheng, Z., Wang, J., Dong, Y., Ma, H., Zhou, H., Sugimura, H., Lu, G., & Zhou, X. (2008). Ephb1 is underexpressed in poorly differentiated colorectal cancers. Pathobiology, 75(5), 274-280. [ Links ]

49. Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D'Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T., & Sellers, W. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203-209. [ Links ]

50. Starza, R. L., Crescenzi, B., Pierini, V., Romoli, S., Gorello, P., Brandimarte, L., Matteucci, C., Kropp, M., Barba, G., Martelli, M., & Mecucci, C. (2007). A common 93-kb duplicated dna sequence at 1q21.2 in acute lymphoblastic leukemia and burkitt lymphoma. Cancer Genetics and Cytogenetics, 175(1), 73-76. [ Links ]

51. Tang, Y., Zhang, Y., & Huang, Z. (2007). Development of two-stage svm-rfe gene selection strategy for microarray expression data analysis. IEEE/ACM Transactions on Computational Biology and Bioin-formatics, 4(3), 365-381. [ Links ]

52. Vant'Veer, L., Dai, H., Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R., & Friend, S. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 1(415), 530-536. [ Links ]

53. Wang, L., Zhu, J., & Zou, H. (2008). Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3), 412-419. [ Links ]

54. Wang, Q., Williamson, M., Bott, S., Brookman-Amissah, N., Freeman, A., Nariculam1, J., Hubank3, M., Ahmed, A., & Masters, J. (2007). Hypomethylation of wnt5a, crip1 and s100p in prostate cancer. Oncogene, 26, 6560-6565. [ Links ]

55. Yang, J., Shi, Y., Cheng, Q., & Deng, L. (2006). Expression and localization of aquaporin-5 in the epithelial ovarian tumors. Gynecologic Oncology, 100(2), 294-299. [ Links ]