Reactive Scheduling of DAG Applications on Heterogeneous and Dynamic Distributed Computing Systems

Hernández Hernández, Jesús Israel

Services on Demand

Journal

Article

Indicators

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.13 n.2 Ciudad de México Oct./Dec. 2009

Resumen de tesis doctoral

Reactive Scheduling of DAG Applications on Heterogeneous and Dynamic Distributed Computing Systems

Mapeo de Aplicaciones Paralelas tipo DAG en Sistemas Distribuidos Heterogéneos y Dinámicos

Graduated: Jesús Israel Hernández Hernández
Institute for Computing Systems Architecture
School of Informatics
University of Edinburgh, UK.
j.i.hernandez@sms.ed.ac.uk

Supervisor: Murray Cole
Institute for Computing Systems Architecture
School of Informatics
University of Edinburgh, UK.
mic@inf.ed.ac.uk

Graduated in December 4th, 2008

Abstract

Emerging computational platforms enable a set of geographically distributed computers with different capabilities to be linked together and used in a coordinated fashion to solve a parallel application at the same time. Effective scheduling mechanisms are essential to exploit the tremendous potential of computational resources offered by such platforms. We consider the problem of scheduling parallel applications which are often abstracted as directed acyclic graphs (DAGs), in which vertices represent application tasks and edges represent data dependencies between tasks. The core scheduling issues are that the availability and performance of resources, which are already by their nature heterogeneous, can be expected to vary dynamically, even during the course of an execution. This thesis summary presents the main results of the Global Task Positioning (GTP) mapping method, which is based on the cyclic use of a static mapping method over time. We place strong emphasis in three key aspects, which we believe are central to address the dynamic nature of the problem: reactivity, data–aware components and fault tolerance.

Keywords: Parallel processing, heterogeneous computing, task scheduling, DAG scheduling, fault tolerance.

Resumen

Plataformas computacionales emergentes permiten la compartición de recursos computacionales conectados a una red de alta velocidad y localizados en sitios distribuidos geográficamente, en la solución de una aplicación de manera concurrente. En este contexto, mecanismos de asignación de tareas se vuelven esenciales para explotar el tremendo potencial de recursos computacionales. Nuestra investigación considera el problema de mapear aplicaciones paralelas, frecuentemente representadas por grafos del tipo DAG (Directed Acyclic Graphs), en ambientes computacionales distribuidos, heterogéneos y dinámicos. El punto central del problema es que la disponibilidad y desempeño de los recursos computacionales pueden variar con el tiempo, incluso antes de terminar la ejecución de la aplicación. Ponemos especial énfasis en tres aspectos clave, los cuales creemos son primordiales para tratar la naturaleza dinámica el problema: adaptabilidad, reuso de información y tolerancia a fallas. Este resumen de tesis comparte la experiencia adquirida en el área y muestra los resultados principales del método de mapeo de aplicaciones paralelas GTP (Global Task Positioning) con sus respectivas variantes.

Palabras clave: Cómputo paralelo, cómputo heterogéneo, mapeo de tareas, tolerancia a fallas.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. A. Chervenak, I. Foster, C. Kesselman, C. Salisbury and S. Tuecke, "The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets", Journal of Network & Computer Applic., 23(3): 187–200 (1999). [ Links ]

2. Deelman, E., Kesselman, C., Blythe, J., and Gil, Y, "Mapping abstract complex workflows onto grid environments", Journal of Grid Computing, 1(1):25—39 (2003). [ Links ]

3. Eshaghian, M. and Wu, Y., "Mapping heterogeneous task graphs onto heterogeneous system graphs", In Proceedings of Heterogeneous Computing Workshop (HCW'97), pages 147–160, 1997. [ Links ]

4. Foster, I., and Kesselman,C., "The Grid: Blueprint for a Future Computing Infrastructure", Morgan Kaufmann Publishers, USA, 1999 [ Links ]

5. Foster, I., Kesselman, C., and Tuecke, S, "The anatomy of the grid: Enabling scalable virtual organizations", International Journal on Supercomputer Applications, 15(3):200–222 (2001). [ Links ]

6. Gary, M. and Johnson, D. Computers and intractability: a guide to the theory of np–completeness. W.H. Freeman and co., New York, 1979. [ Links ]

7. Gerasoulis, A. and Yang, T., "A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors", Journal of Parallel and Distributed Computing, 16(4):276–291 (1992). [ Links ]

8. Hernandez, I. and Cole, M., "Reactive grid scheduling of dag applications", In Proceedings of the 25th IASTED(PDCN), Acta Press, pages 92–97, 2007a. [ Links ]

9. Hernandez, I. and Cole, M., "Reliable DAG scheduling with rewinding and migration", In Proc.of the First International Conference on Networks for Grid Applications(GridNets), ACM Press, pages 1–8,2007b. [ Links ]

10. Hernandez, I. and Cole, M., "Scheduling DAGs on grids with copying and migration", Parallel Processing and Applied Mathematics (PPAM07), Springer LNCS, pages 1019–1028, 2007c. [ Links ]

11. In, J., Avery, P., and Ranka, S., "Sphinx: A fault–tolerant system for scheduling in dynamic grid environments", In Proc. of the 19th International Parallel and Distributed Processing Symposium (IPDPS), pages 12–22, 2005. [ Links ]

12. Kwok, Y. and Ahmad, I., "Static algorithms for allocating directed task graphs to multiprocessors", ACM Computing Surveys, 31(4):406–471 (1999). [ Links ]

13. Maheswaran,M. and Siegel, H., "A dynamic matching and scheduling algorithm for heterogeneous systems", In Proceedings of the 7th Heterogeneous Computing Workshop (HCW), pages 57–69, 1998. [ Links ]

14. MDS, "The Monitoring and Discovery System", http://globus.org/mds, 2000. [ Links ]

15. Medeiros, R., Cirne, W., Brasileiro, F., and Sauve, J., "Faults in grids: Why are they so bad and what can be done about it?", In Proceeding of the International Workshop on Grid Computing, pages 18–24, 2003. [ Links ]

16. NWS, "The Network Weather Service", http://nws.cs.ucsb.edu, 2002. [ Links ]

17. Papadimitriou, C. and Steiglitz, K., "Combinatorial optimization: Algorithms and complexity", Dover Pub., INC., 1998. [ Links ]

18. Pegasus, "Planning for execution in grids", http://pegasus.isi.edu/, 2003. [ Links ]

19. Ranganathan, K. and Foster, I.. "Computation and data scheduling for large scale distributed computing", Proceedings of the 19th IEEE Euromicro–PDP, pages 263–275, 2004. [ Links ]

20. Shi, Z. and Dongarra, J., "Scheduling workflows applications on processors with different capabilities", Future Generation Computer Systems (FGCS), 22(6):665–675 (2006). [ Links ]

21. Sih, G. and Lee, E., "A compile–time scheduling heuristic for interconnection constrained heterogeneous processor architectures". IEEE Trans. on Parallel and Distributed Systems, 4(2): 175–187 (1993). [ Links ]

22. Simgrid, "The simgrid project homepage", http://simgrid.gforge.inria.fr/, 2001. [ Links ]

23. STG, "The Standard Task Graph project", http://www.kasahara.elec.waseda.ac.jp/schedule/, 2000. [ Links ]

24. Topcuoglu, H., "Performance–effective and low–complexity task scheduling for heterogeneous computing", IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274 (2002). [ Links ]

25. Zhao, H. and Sakellariou, R., "A low–cost rescheduling policy for efficient mapping of workflows on grid systems", Scientific Programming SPR, 12(4):253–262 (2004). [ Links ]