Português English
Contato

Tese de Doutorado de Bruno Iochins Grisci


Detalhes do Evento


Aluno: Bruno Iochins Grisci
Orientador: Prof. Dr. Márcio Dorn

Título: Knowledge discovery in biological tabular data through machine learning interpretability and visualization
Linha de Pesquisa: Aprendizado de Máquina, Representação de Conhecimento e Raciocínio

Data: 26/06/2023
Horário: 08h30min
Local: Esta banca ocorrerá de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://meet.google.com/aqk-keoc-fys

Banca Examinadora:
– Profa. Dra. Juliana Silva Bernardes (Sorbonne Université)
– Profa. Dra. Rosane Minghim (University College Cork)
– Profa. Dra. Carla Maria Dal Sasso Freitas (UFRGS)

Presidente da Banca: Prof. Dr. Márcio Dorn

Abstract: One of the critical problems in the biological and health sciences is the identification of new biomarkers. These molecules can be used to identify cancers, study the mechanisms behind the diseases, and help develop treatments. However, identifying biomarkers from raw genetic data presents several challenges, and the feature selection literature focuses on outdated datasets. This work proposes using deep learning and new interpretability methods to analyze and extract relevant biological information from high-dimensional cancer datasets. The lack of interpretability of neural networks is partially why they are not adopted in broader applications. Many works focus on explaining their predictions, but only some take tabular data into consideration, which led to a small adoption. We present “relevance aggregation,” an algorithm that combines the relevance computed from several samples as learned by a neural network and generates scores for each input feature. The method was tested in synthetic and real-world datasets for classification and regression tasks. It correctly identified the most important features for the network’s predictions. The rank of feature scores matches their contribution to the model’s performance, and the top-ranked features consistently improved the performance of an independent classifier. We also present a visualization method for inspecting feature scoring results called “weighted t-SNE.” It was evaluated with several feature scorers and datasets, producing trustworthy visualizations to help the user decide which algorithm to use. The silhouette coefficient of weighted t-SNE is correlated to the prediction performance of neural networks and can help in machine learning interpretability. Finally, we trained neural networks to learn how to classify datasets from cancer datasets. They were inspected with relevance aggregation and weighted t-SNE, from which the most relevant genes were obtained. The feature relevance scores learned from the neural networks improved the class separability for all datasets. Functional enrichment analysis of the genes with higher relevance scores from breast and brain cancer datasets showed that most are linked to known cancer bioprocesses and cellular components. Some are even known biomarkers, according to a literature review. The results indicate that the proposed methods can automatically extract new and relevant biological information from gene expression.

Keywords: Feature Selection. Interpretable Machine Learning. Artificial Neural Networks. Dimensionality Reduction. Data Visualization. Gene Expression Data. Machine Learning. Gene Selection. Microarray.