Proposta de Tese em Processamento Paralelo e Distribuído

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL

INSTITUTO DE INFORMÁTICA

PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO

———————————————-

DEFESA DE PROPOSTA DE TESE

Aluno: Matthias Diener

Orientador: Prof. Dr. Philippe Olivier Alexandre Navaux

Título: Automatic Thread and Data Mapping in Shared-Memory Architectures

Linha de Pesquisa: Processamento Paralelo e Distribuído

Data: 24/07/2014

Horário: 10:30h

Local: Prédio 43424 – Auditório Prof. Castilho, Instituto de Informática

Banca Examinadora:

Prof. Dr. Alexandre da Silva Carissimi (UFRGS)

Prof. Dr. Flávio Rech Wagner (UFRGS)

Prof. Dr. Rodolfo Jardim de Azevedo (UNICAMP)

Presidente da Banca: Prof. Dr. Philippe Olivier Alexandre Navaux

Resumo:

Modern parallel architectures have complex memory hierarchies, which consist of several levels of private and shared caches, as well as Non-Uniform Memory Access (NUMA) behavior due to multiple memory controllers per system. A major challenge in these architectures is to improve the locality of memory accesses in such a way that the overall memory access latency is reduced, as this can improve both performance and energy efficiency of parallel applications. The locality can

be improved in two ways: (1) Map threads and processes that access shared data (communicate) to execution units that are close to each other in the memory hierarchy in order to improve the usage of caches. We refer to this technique as thread mapping. (2) Map the memory pages that each thread or processes accesses to the NUMA node that it is executing on, in order to reduce accesses to remote memories in NUMA architectures. We call this technique data mapping. For optimal results, thread and data mapping need to be performed in an integrated way. Previous work in this area performs the mapping only separately, which limits the gains that can be achieved. Furthermore, most previous mechanisms require expensive operations, such as communication or memory access traces, to perform the mapping, require changes to the hardware or to the parallel application, or use a simple static mapping. These mechanisms can not be considered generic solutions for the mapping problem. In this thesis, we make two contributions to the mapping problem. First, we introduce a set of metrics and a methodology to analyze parallel applications in order to determine their suitability for an improved mapping and to evaluate the possible gains that can be achieved using an optimized mapping. Second, we propose two mechanisms that perform online thread mapping and online thread/data mapping, respectively. These mechanisms work on the operating system level and require no changes to the hardware, the applications themselves or their runtime libraries. An extensive evaluation with parallel benchmarks from 4 benchmark suites show performance and energy efficiency improvements of up to 35.4% and 34.6%, respectively, with an average overhead of only 1.8%.

Palavras-chave: Thread mapping, Data mapping, Shared memory, Multicore, NUMA