Português English
Contato

Defesa – Dissertação de Diego Dimer Rodrigues


Detalhes do Evento


Aluno(a): Diego Dimer Rodrigues
Orientador(a): Mariana Recamonde Mendoza Guerreiro

Título: Systematic assessment of data-level bias and its impact on machine learning classifiers
Linha de Pesquisa: Aprendizado de Máquina, Representação de Conhecimento e Raciocínio

Data: 01/07/2026
Hora: 15:30
Local: Esta banca ocorrerá de forma híbrida (virtual e presencial), na sala Sala 215/43412 do Instituto de Informática/UFRGS e pelo link https://mconf.ufrgs.br/webconf/00195037.

Banca Examinadora:
-Lilian Berton (Unifesp)
-Betania Silva Carneiro Campello (UNICAMP)
-Joel Carbonera (UFRGS)

Presidente da Banca: Mariana Recamonde Mendoza Guerreiro

Resumo: Machine learning models are increasingly deployed in high-stakes domains such as healthcare, criminal justice, and finance, where biased predictions can cause systematic harm to specific population groups. A critical challenge in this context is the presence of data-level bias – distributional disparities rooted in the underrepresentation of historically disadvantaged groups – which can propagate into models and produce unfair outcomes before any prediction is made. While existing fairness toolkits offer a wide range of metrics to evaluate models and mitigation strategies, there is a lack of systematic empirical evidence linking data-level statistical properties to downstream model-level disparities, leaving practitioners without reliable early-warning indicators. This thesis addresses that gap by systematically investigating, through controlled manipulation of 11 diverse datasets into equally balanced and highly biased variations, h ow four data-level statistical metrics correlate with model fairness outcomes across four supervised learning algorithms. The results demonstrate that distributional metrics, particularly the divergence-based ones, outperform class imbalance as predictors of downstream unfairness, that no algorithm is inherently robust to data-level bias, and that balanced training consistently reduces model reliance on sensitive attributes, providing an evidence-based foundation for embedding fairness audits at the earliest stage of the machine learning pipeline.

Palavras-Chave: machine learning, bias, pre-training bias metrics, data-level bias metrics, health, model evaluation, model-level bias, post-training bias metrics