Português English

Proposta de Tese de Doutorado de Jean Luca Bez

Detalhes do Evento


Aluno: Jean Luca Bez
Orientador: Prof. Dr. Philippe Olivier Alexandre Navaux
Coorientador: Prof. Dr. Antonio Cortes Rosseló 

Título: Dynamic Tuning and Reconfiguration of the I/O Forwarding Layer in HPC Platforms
Linha de Pesquisa: Processamento Paralelo e Distribuído 

Data: 05/08/2020
Horário: 10h

Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: http://meet.google.com/dyw-aqth-adz 

Banca Examinadora:

– Profª. Drª. Carla Osthoff Ferreira de Barros (LNCC)
– Prof. Dr. Lucas Mello Schnorr (UFRGS)
– Prof. Dr. Luciano Paschoal Gaspary (UFRGS) 

Presidente da Banca: Prof. Dr. Philippe Olivier Alexandre Navaux

Abstract: Input and output (I/O) operations are a bottleneck for an increasing number of applications in High-Performance Computing (HPC) platforms. Furthermore, it has the potential of critically impacting performance on the next generation of supercomputers. I/O optimization techniques can provide improvements for specific system configurations and application access patterns, but not for all of them. We call the access pattern the way an application performs its I/O operations. These techniques frequently rely on the precise tune of parameters, which commonly falls back to the users. Considering that, in such large scale systems we have an ever-changing application set running with distinct characteristics and demands. To improve performance successfully, it is essential to adapt the system dynamically to a changing workload. In this work, we seek to guide optimization and tuning strategies by identifying the application’s I/O access pattern. We evaluate three machine learning techniques to automatically detect the I/O access pattern of HPC applications at runtime: decision trees, random forests, and neural networks. We focus on the detection using metrics from file-level accesses, as seen by the clients, I/O nodes, and parallel file system servers.  We evaluated these detection strategies in a case study in which the accurate detection of the current access pattern is fundamental to adjust a parameter of an I/O scheduling algorithm. Using the detected access pattern, we propose a tuning strategy that uses a reinforcement learning technique (contextual bandits) to make the system capable of learning the best parameter value to each observed access pattern during its execution. That eliminates the need for a complicated and time-consuming previous training phase. We evaluate our proposal and demonstrate it can reach a precision of 88% on the parameter selection in the first hundreds of observations of an access pattern, achieving 99% of the optimal performance. We demonstrate that the system will be able to adapt to changes and optimize its performance after observing a pattern for a few (not necessarily contiguous) minutes. Finally, the standard approach is to statically assign I/O forwarding nodes to applications depending on the number of processing nodes they use, which is not necessarily related to their I/O requirements. This strategy leads to inefficient usage of these resources. We propose an optimal policy to allocate available forwarders to applications depending on their access pattern and I/O demands.  Our allocation policy aims at maximizing global bandwidth by giving more I/O nodes to applications that will benefit the most. Initial results showed that our dynamic Multiple-Choice Knapsack Policy (MCKP) allocation policy can improve global I/O bandwidth by up to 23× compared to the existing static option.

Keywords: high-performance computing, parallel I/O, I/O forwarding, access pattern detection, allocation policy.