UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL
INSTITUTO DE INFORMÁTICA
PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO
———————————————-
Aluno: Matthias Diener
Orientador: Prof. Dr. Philippe Olivier Alexandre Navaux
Título: Automatic Thread and Data Mapping in Shared-Memory Architectures
Linha de Pesquisa: Processamento Paralelo e Distribuído
Data: 24/07/2014
Local: Prédio 43424 – Auditório Prof. Castilho, Instituto de Informática
Banca Examinadora:
Prof. Dr. Alexandre da Silva Carissimi (UFRGS)
Prof. Dr. Flávio Rech Wagner (UFRGS)
Prof. Dr. Rodolfo Jardim de Azevedo (UNICAMP)
Resumo:
Modern parallel architectures have complex memory hierarchies, which consist of several levels of private and shared caches, as well as Non-Uniform Memory Access (NUMA) behavior due to multiple memory controllers per system. A major challenge in these architectures is to improve the locality of memory accesses in such a way that the overall memory access latency is reduced, as this can improve both performance and energy efficiency of parallel applications. The locality can
be improved in two ways: (1) Map threads and processes that access shared data (communicate) to execution units that are close to each other in the memory hierarchy in order to improve the usage of caches. We refer to this technique as thread mapping. (2) Map the memory pages that each thread or processes accesses to the NUMA node that it is executing on, in order to reduce accesses to remote memories in NUMA architectures. We call this technique data mapping. For optimal results, thread and data mapping need to be performed in an integrated way. Previous work in this area performs the mapping only separately, which limits the gains that can be achieved. Furthermore, most previous mechanisms require expensive operations, such as communication or memory access traces, to perform the mapping, require changes to the hardware or to the parallel application, or use a simple static mapping. These mechanisms can not be considered generic solutions for the mapping problem. In this thesis, we make two contributions to the mapping problem. First, we introduce a set of metrics and a methodology to analyze parallel applications in order to determine their suitability for an improved mapping and to evaluate the possible gains that can be achieved using an optimized mapping. Second, we propose two mechanisms that perform online thread mapping and online thread/data mapping, respectively. These mechanisms work on the operating system level and require no changes to the hardware, the applications themselves or their runtime libraries. An extensive evaluation with parallel benchmarks from 4 benchmark suites show performance and energy efficiency improvements of up to 35.4% and 34.6%, respectively, with an average overhead of only 1.8%.
Palavras-chave: Thread mapping, Data mapping, Shared memory, Multicore, NUMA