Aluno: Lucas Leandro Nesi
Orientador: Prof. Dr. Lucas Mello Schnorr
Título: Strategies for Distributing Task-Based Applications on Heterogeneous Platforms
Linha de Pesquisa: Processamento Paralelo e Distribuído
Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://mconf.ufrgs.br/webconf/sala205
– Prof. Dr. Alfredo Goldman Vel Lejbman (USP)
– Prof. Dr. Márcio Bastos Castro (UFSC)
– Prof. Dr. Philippe Olivier Alexandre Navaux (UFRGS)
Presidente da Banca: Profa. Dra. Lucas Mello Schnorr
Abstract: Heterogeneity is present in any HPC platform by intra-node resources like accelerators or internode at the system level when there are different machines. This resource diversity appears both at supercomputers, with multiple partitions, and in Cloud. Applications, to cope with system-level heterogeneity, require special distributions to divide their load into the resources. One paradigm that is gaining popularity to handle complex systems on large platforms is the Task-based one, which reduces the programmer’s burden and makes decisions dynamic. Moreover, the task-based applications presents the necessary flexibility to handle these heterogeneous distributions. This thesis proposal studies the problem of distributing task-based applications on heterogeneous resources, considering application characteristics and communication, investigating many challenges, and proposing strategies to solve them. One challenge is improving an already asymptotically optimal data distribution for the LU factorization. This thesis proposal presents two methods. First, limit the number of nodes at the end of the computation considering communication and the DAG. Second, to move blocks to improve inter-iteration balance. The combination of this methods further improves balance and communication of the already asymptotically distribution. Moreover, applications may have multiple phases, algorithmic operations, with distinct resource necessities that can take advantage of this inter-node heterogeneity to enhance performance and reduce resource idleness. This thesis proposal introduce strategies to efficiently distribute asynchronous multi-phase applications in system-level heterogeneous resources using multiple distributions. First, it offers strategies to improve application phase overlap, gaining up to 50% of performance, and, second, to compute a distribution for all the phases using a linear program leveraging node heterogeneity while limiting communication overhead. This thesis proposal shows that adding some slow nodes to a homogeneous set of fast nodes can improve the performance by another 25%, harnessing any machine. The next and future challenge is the identification of the best set of nodes to use, as infinite resources are not the best case in performance or cost. This thesis proposal plans on using reinforcement learning methods, k-bandits, and the Gaussian process to learn the best set of nodes for each application phase during the execution. Further challenges discussed are experimenting with more applications with other particularities and scalability experiments in supercomputers, including the SDumont.
Keywords: HPC, Heterogeneity, Task-Based, Distribution, Partitioning, Multi-Phase.