Decision Tree based Classifiers for Large Datasets

Franco-Arcega, Anilu; Carrasco-Ochoa, Jesús Ariel; Sánchez-Díaz, Guillermo; Martínez-Trinidad, José Francisco

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.17 n.1 Ciudad de México Jan./Mar. 2013

Resumen de tesis doctoral

Decision Tree based Classifiers for Large Datasets

Clasificadores basados en árboles de decisión para grandes conjuntos de datos

Anilu Franco-Arcega^1,2, Jesús Ariel Carrasco-Ochoa², Guillermo Sánchez-Díaz³, and José Francisco Martínez-Trinidad²

¹ Universidad Autónoma del Estado de Hidalgo, Hidalgo, México.

² Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México.

³ Universidad Autónoma de San Luis Potosí, San Luis Potosí, México. afranco@uaeh.edu.mx, anifranco6@inaoep.mx, ariel@inaoep.mx, fmartine@inaoep.mx, guillermo.sanchez@uaslp.mx

Article received on 21/09/2011.
Accepted on 25/09/2011.

Abstract

In this paper, several algorithms have been developed for building decision trees from large datasets. These algorithms overcome some restrictions of the most recent algorithms in the state of the art. Three of these algorithms have been designed to process datasets described exclusively by numeric attributes, and the fourth one, for processing mixed datasets. The proposed algorithms process all the training instances without storing the whole dataset in the main memory. Besides, the developed algorithms are faster than the most recent algorithms for building decision trees from large datasets, and reach competitive accuracy rates.

Keywords: Decision trees, supervised classification, large datasets.

Resumen

En este artículo se desarrollaron varios algoritmos de generación de árboles de decisión a partir de grandes conjuntos de datos, los cuales resuelven algunas de las limitaciones de los algoritmos más recientes del estado del arte. Tres de estos algoritmos permiten procesar conjuntos de datos descritos exclusivamente por atributos numéricos; y otro puede procesar conjuntos de datos mezclados. Los algoritmos propuestos procesan todos los objetos del conjunto de entrenamiento sin necesidad de almacenarlo completo en memoria. Además, los algoritmos desarrollados son más rápidos que los algoritmos más recientes para la generación de árboles de decisión para grandes conjuntos de datos, obteniendo resultados de clasificación competitivos.

Palabras clave: Árboles de decisión, clasificación supervisada, grandes conjuntos de datos.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgements

Authors wish to thank CONACyT for its support with the grant 165151 given to the first author of this paper, and the project grants CB2008-106443 and CB2008-106366.

References

1. Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, USA, 2-8. [ Links ]

2. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression Trees. Belmont, Calif.: Wadsworth International Group. [ Links ]

3. Chakrabarti, S., Cox, E., Frank, E., Guting, R.H., Han, J., Jiang, X., Kamber, M., Lightstone, S.S., Nadeau, T.P., Neapolitan, R.E., Pyle, D., Refaat, M., Schneider, M., Teorey, T.J., & Witten, I.H. (2009). Data Mining: Know it all. Burlington, MA: Elsevier/ Morgan Kaufmann Publishers. [ Links ]

4. Domingos, P. & Hulten, G. (2000). Mining highspeed data streams. Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, USA, 71-80. [ Links ]

5. Gama, J. & Medas, P. (2005). Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11 (8), 1353-1366. [ Links ]

6. Gehrke, J., Ganti, V., Ramakrishnan, R., & Loh, W.Y. (1999). BOAT - Optimistic decision tree construction. ACM SIGMOD Record, 28(2), 169180. [ Links ]

7. Gehrke, J., Ramakrishnan, R., & Ganti, V. (2000). Rainforest - A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery, 4(2-3), 127-162. [ Links ]

8. Janikow, C.Z. (1998). Fuzzy decision trees: Issues and methods. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics, 28(1), 1-14. [ Links ]

9. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. Fifth International Conference Extending Database Technology (EDBT), Avignon, France, 18-32. [ Links ]

10. Mitchell, T.M. (1997). Machine Learning. New York: McGraw Hill. [ Links ]

11. PÃ©rez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martin, J.I. (2007). Combining multiple class distribution modified subsamples in a single tree. Pattern Recognition Letters, 28(4), 414-422. [ Links ]

12. Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81 -106. [ Links ]

13. Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann Publishers. [ Links ]

14. SDSS - Adelman-McCarthy, J., Agueros, M. A., Allam, S.S., et al. (2008). The Sixth Data Release of the Sloan Digital Sky Survey. The Astrophysical Journal Supplement, 175(2), 297-313. [ Links ]

15. Shafer, J.C., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. 22^nd International Conference on Very Large Data Bases (VLDB '96), Mumbai, India, 544-555. [ Links ]

16. Hsing-Kuo, P., Shou-Chih, C., & Yuh-Jye, L. (2005). Model trees for classification of hybrid data types. 6^th international conference on Intelligent Data Engineering and Automated Learning (IDEAL'05), Queensland, Australia, 32-39. [ Links ]

17. Kdd cup 1999 data (1999). Retrieved from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [ Links ]

18. Utgoff, P.E. (1989). Incremental induction of decision trees. Machine Learning, 4(2), 161-186. [ Links ]

19. Utgoff, P.E., Berkman, N.C., & Clouse, J.A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1), 5-44. [ Links ]

20. Yang, B., Wang, T., Yang, D., & Chang, L. (2008). BOAI: Fast alternating decision tree induction based on bottom-up evaluation. 12^th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'08), Osaka, Japan, 405-416. [ Links ]

21. Yoon, H., Alsabti, K., & Ranka, S. (1999). Tree-based incremental classification for large datasets. Technical Report (TR-99-013), Gainesville, FL.: University of Florida. [ Links ]

Note

¹Extended abstract of PhD thesis. Graduated: Anilu Franco-Arcega. Advisors Jesus Ariel Carrasco Ochoa, Guillermo Sanchez-Diaz, and Jose Francisco Martinez-Trinidad. Graduation date: 14/07/2010.