1405-5546

S1405-55462013000100010

México

00 03 2013

17 1 95 102

Resumen de tesis doctoral

Decision Tree based Classifiers for Large Datasets

Clasificadores basados en árboles de decisión para grandes conjuntos de datos

Anilu Franco-Arcega^1,2, Jesús Ariel Carrasco-Ochoa², Guillermo Sánchez-Díaz³, and José Francisco Martínez-Trinidad²

¹ Universidad Autónoma del Estado de Hidalgo, Hidalgo, México.

]]> ² Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México.

³ Universidad Autónoma de San Luis Potosí, San Luis Potosí, México. afranco@uaeh.edu.mx, anifranco6@inaoep.mx, ariel@inaoep.mx, fmartine@inaoep.mx, guillermo.sanchez@uaslp.mx

Article received on 21/09/2011.
Accepted on 25/09/2011.

Abstract

In this paper, several algorithms have been developed for building decision trees from large datasets. These algorithms overcome some restrictions of the most recent algorithms in the state of the art. Three of these algorithms have been designed to process datasets described exclusively by numeric attributes, and the fourth one, for processing mixed datasets. The proposed algorithms process all the training instances without storing the whole dataset in the main memory. Besides, the developed algorithms are faster than the most recent algorithms for building decision trees from large datasets, and reach competitive accuracy rates.

Keywords: Decision trees, supervised classification, large datasets.

]]> Resumen

En este artículo se desarrollaron varios algoritmos de generación de árboles de decisión a partir de grandes conjuntos de datos, los cuales resuelven algunas de las limitaciones de los algoritmos más recientes del estado del arte. Tres de estos algoritmos permiten procesar conjuntos de datos descritos exclusivamente por atributos numéricos; y otro puede procesar conjuntos de datos mezclados. Los algoritmos propuestos procesan todos los objetos del conjunto de entrenamiento sin necesidad de almacenarlo completo en memoria. Además, los algoritmos desarrollados son más rápidos que los algoritmos más recientes para la generación de árboles de decisión para grandes conjuntos de datos, obteniendo resultados de clasificación competitivos.

Palabras clave: Árboles de decisión, clasificación supervisada, grandes conjuntos de datos.

DESCARGAR ARTÍCULO EN FORMATO PDF

Acknowledgements

Authors wish to thank CONACyT for its support with the grant 165151 given to the first author of this paper, and the project grants CB2008-106443 and CB2008-106366.

References

]]>

1. Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, USA, 2-8. [ Links ]

2. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression Trees. Belmont, Calif.: Wadsworth International Group. [ Links ]

3. Chakrabarti, S., Cox, E., Frank, E., Guting, R.H., Han, J., Jiang, X., Kamber, M., Lightstone, S.S., Nadeau, T.P., Neapolitan, R.E., Pyle, D., Refaat, M., Schneider, M., Teorey, T.J., & Witten, I.H. (2009). Data Mining: Know it all. Burlington, MA: Elsevier/ Morgan Kaufmann Publishers. [ Links ]

4. Domingos, P. & Hulten, G. (2000). Mining highspeed data streams. Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, USA, 71-80. [ Links ]

5. Gama, J. & Medas, P. (2005). Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11 (8), 1353-1366. [ Links ]

]]>

6. Gehrke, J., Ganti, V., Ramakrishnan, R., & Loh, W.Y. (1999). BOAT - Optimistic decision tree construction. ACM SIGMOD Record, 28(2), 169180. [ Links ]

7. Gehrke, J., Ramakrishnan, R., & Ganti, V. (2000). Rainforest - A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery, 4(2-3), 127-162. [ Links ]

8. Janikow, C.Z. (1998). Fuzzy decision trees: Issues and methods. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics, 28(1), 1-14. [ Links ]

9. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. Fifth International Conference Extending Database Technology (EDBT), Avignon, France, 18-32. [ Links ]

10. Mitchell, T.M. (1997). Machine Learning. New York: McGraw Hill. [ Links ]

]]>

11. PÃ©rez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martin, J.I. (2007). Combining multiple class distribution modified subsamples in a single tree. Pattern Recognition Letters, 28(4), 414-422. [ Links ]

12. Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81 -106. [ Links ]

13. Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann Publishers. [ Links ]

14. SDSS - Adelman-McCarthy, J., Agueros, M. A., Allam, S.S., et al. (2008). The Sixth Data Release of the Sloan Digital Sky Survey. The Astrophysical Journal Supplement, 175(2), 297-313. [ Links ]

15. Shafer, J.C., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. 22^nd International Conference on Very Large Data Bases (VLDB '96), Mumbai, India, 544-555. [ Links ]

]]>

16. Hsing-Kuo, P., Shou-Chih, C., & Yuh-Jye, L. (2005). Model trees for classification of hybrid data types. 6^th international conference on Intelligent Data Engineering and Automated Learning (IDEAL'05), Queensland, Australia, 32-39. [ Links ]

17. Kdd cup 1999 data (1999). Retrieved from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [ Links ]

18. Utgoff, P.E. (1989). Incremental induction of decision trees. Machine Learning, 4(2), 161-186. [ Links ]

19. Utgoff, P.E., Berkman, N.C., & Clouse, J.A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1), 5-44. [ Links ]

20. Yang, B., Wang, T., Yang, D., & Chang, L. (2008). BOAI: Fast alternating decision tree induction based on bottom-up evaluation. 12^th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'08), Osaka, Japan, 405-416. [ Links ]

21. Yoon, H., Alsabti, K., & Ranka, S. (1999). Tree-based incremental classification for large datasets. Technical Report (TR-99-013), Gainesville, FL.: University of Florida. [ Links ]

Note

¹Extended abstract of PhD thesis. Graduated: Anilu Franco-Arcega. Advisors Jesus Ariel Carrasco Ochoa, Guillermo Sanchez-Diaz, and Jose Francisco Martinez-Trinidad. Graduation date: 14/07/2010.

]]>

1998

2-8

1984

2009

2000

71-80

2005 11 8 8

1353-1366

1999 28 2 2

169180

2000 4 2-3 2-3

127-162

1998 28 1 1

1-14

1996

18-32

1997

2007 28 4 4

414-422

1986 1 1 1

81 -106

1993

SDSS

2008 175 2 2

297-313

1996

544-555

2005

32-39

Kdd cup 1999

1989 4 2 2

161-186

1997 29 1 1

5-44

2008

405-416

1999