Português English

Tese de Doutorado de Kassiano José Matteussi

Detalhes do Evento

Aluno: Kassiano José Matteussi

Orientador: Prof. Dr. Claudio Fernando Resin Geyer

Título: Cache-Based Global Memory Orchestration For Data-Intensive Stream Processing Pipelines

Linha de Pesquisa: Computação de Alto Desempenho e Sistemas Distribuídos


Data: 24/06/2022

Horário: 9h

Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://mconf.ufrgs.br/webconf/00000976


Banca Examinadora:

– Prof. Dr. Luiz Gustavo Leão Fernandes (PUCRS)
– Prof. Dr. Jorge Luis Victoria Barbosa (UNISINOS)
– Prof. Dr. Edison Pignaton de Freitas (UFRGS)


Presidente da Banca: Prof. Dr. Claudio Fernando Resin Geyer


Abstract: A significant rise in the adoption of streaming applications has changed the decision-making processes in the last decade. This movement led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, for providing in-memory data-intensive processing in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. Streaming systems use the Java Virtual Machine (JVM) as the underlying processing environment for platform independence. Although it provides high-level hardware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. As a result, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operation, and others. State of the art reinforces efficient memory management for Stream Processing plays a prominent role in real-time data analysis since it represents a critical task for performance. These solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, the previous studies can not control the JVM state by using embedded mechanisms that are not aware of the processing and storage utilization as a whole. This thesis considers the impact of overall cache utilization on SP processing to propose a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based performance issues, and keep the application’s throughput stable. Still, the proposed evaluation comprises real experiments in small and medium-sized data center infrastructures with fast network switches. The experiments use Apache Spark and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and check-pointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50\% in the evaluated cases

Keywords: Data-Intensive Stream Processing. Streaming Applications. Real-Time Big Data Analytics. Apache Spark. Memory Management. Data Orchestration.