Português English
Contato

Defesa – Dissertação de Augusto Exenberger Becker


Detalhes do Evento


Aluno(a): Augusto Exenberger Becker
Orientador(a): Mariana Recamonde Mendoza Guerreiro

Título: A Systematic Study of the Impact of Data Leakage in Machine Learning Pipelines for Classification
Linha de Pesquisa: Aprendizado de Máquina, Representação de Conhecimento e Raciocínio

Data: 05/02/2026
Hora: 09:00
Local: Esta banca ocorrerá de forma híbrida (virtual e presencial), na sala 215/43412 do Instituto de Informática/UFRGS e pelo link https://mconf.ufrgs.br/webconf/00195037.

Banca Examinadora:
-Ana Carolina Lorena (ITA)
-Adriano Velasque Werhli (FURG)
-Dennis Giovani Balreira (UFRGS)

Presidente da Banca: Mariana Recamonde Mendoza Guerreiro

Resumo: Data leakage refers to the inadvertent contamination of training data with information from validation or test sets, leading to overoptimistic performance estimates and poor reproducibility of machine learning (ML) results. Although the general concept of data leakage is well known, its impact can manifest in subtle ways across different stages of the ML pipeline, particularly during data preparation tasks. This dissertation presents a systematic experimental evaluation of the impact of data leakage introduced at different data preparation tasks, including normalization, value imputation, feature selection, hyperparameter tuning, and class balancing, as well as a worst-case scenario in which leakage is simultaneously introduced across all preparation tasks. The experimental study was conducted on 42 structured datasets from the Penn Machine Learning Benchmark, selected to ensure diversity in dataset characteristics. For each dataset, average hardness measures were computed to investigate their relationship with the magnitude of performance inflation caused by data leakage. Models were trained using eight widely used classification algorithms under a controlled paired experimental design, differing only in the presence or absence of data leakage, with F1-score adopted as the primary evaluation metric. The results reveal a consistent tendency of data leakage to inflate performance estimates, with substantial variability across datasets and leakage sources. Average F1-score increases reached up to 14% for specific datasets, and 35 out of the 42 datasets exhibited positive average performance gains. Among individual preparation tasks, class balancing using SMOTE produced the largest average impact, with an average F1-score increase of 3.97%, followed by feature selection and hyperparameter tuning, with average increases of 0.54% and 0.43%, respectively, while normalization and value imputation did not exhibit critical effects. The worst-case scenario analysis revealed markedly larger impacts, with average performance increases of 65%, 16%, and 10% in specific datasets. In contrast, variations across learning algorithms resulted in comparatively smaller differences in performance inflation, although all of them were susceptible to statistically significant effects due to data leakage. Overall, these findings demonstrate that data leakage can substantially distort empirical evaluations in ML. Beyond quantifying these effects, this work derives a set of evidence-based guidelines to mitigate data leakage, prioritizing high-risk pipeline stages and promoting leakage-aware experimental protocols to improve the reliability and reproducibility of ML results.

Palavras-Chave: Data Leakage. Data Contamination. Overfitting. Performance Estimate.Machine Learning.