Dissertação de Mestrado de Lucas Barbosa Castanheira
DEFESA DE DISSERTAÇÃO DE MESTRADO
Aluno: Lucas Barbosa Castanheira
Orientador: Prof. Dr. Alberto Egon Schaeffer Filho
Título: Tracing and Troubleshooting In-Network Computation
Linha de Pesquisa: Arquiteturas, Protocolos e Gerência de Redes e Serviço.
Local: Sala 215 do Prédio 43412 do Instituto de Informática e sala virtual no link: https://cmu.zoom.us/j/3243759682 .
– Prof. Dr. Christian Rodolfo Esteve Rothenberg (UNICAMP)
– Prof. Dr. Weverton Luis da Costa Cordeiro (UFRGS)
– Prof. Dr. Luciano Paschoal Gaspary (UFRGS)
Presidente da Banca: Prof. Dr. Alberto Egon Schaeffer Filho
Abstract: There is a growing move to offload functionality, e.g., TCP or key-value stores, into the network — either on SmartNICs or programmable data planes. While offloading promises significant performance boosts, these programmable devices often provide little visibility into their performance. Moreover, many existing tools for analyzing and debugging performance problems, e.g., distributed tracing, do not extend into these devices. Motivated by this lack of visibility, the first half of this work presents the design and implementation of Foxhound, an observability framework for in-network compute. This framework introduces a co-designed query language, compiler, and storage abstraction layer for expressing, capturing and analyzing distributed traces and their performance data across an infrastructure comprising servers and programmable data planes. While Foxhound is our proof-of-concept for flexible in-network tracing, we discovered that the traditional tracing paradigm which Foxhound embodies can suffer from scalability issues given hardware limitations of programmable data planes. In our effort to mitigate this, we identified a subset of common tracing queries that could be hyper-optimized even beyond Foxhound’s capabilities. These optimizations represent a departure from traditional tracing and constitute another framework, Mimir, presented in the latter half of this work. Mimir trades-off flexibility for efficiency by exploring a set of design choices that optimize for common diagnosis and localization tasks. Our evaluations using three representative offloaded applications on an Intel Tofino-based testbed, an emulator and a simulator show that Mimir can support a subset of common tracing tasks at scale with significant lower overheads than Foxhound. Moreover, our experiments with an in-network-compute-enchanced DeathStarBench “social network” microservice demonstrates the usefulness of our approach for end-to-end diagnosis.
Keywords: In-Network Compute, Telemetry, Debugging.