Português English
Contato

Defesa – Tese de Marcio Nicolau


Detalhes do Evento


Aluno(a): Marcio Nicolau
Orientador(a): Luis da Cunha Lamb

Título: Neurosymbolic ViT-LoMoE: Toward Interpretable and Knowledge-Consistent Visual Recognition
Linha de Pesquisa: Aprendizado de Máquina, Representação de Conhecimento e Raciocínio

Data: 24/03/2026
Hora: 08:00
Local: Esta banca ocorrerá de forma híbrida (virtual e presencial), na sala VIRTUAL do Instituto de Informática/UFRGS e pelo link https://meet.google.com/ftp-pffz-yec.

Banca Examinadora:
-Álvaro Freitas Moreira (UFRGS)
-Ricardo Matsumura Araujo (UFPEL)
-Gabriel de Oliveira Ramos (Unisinos)

Presidente da Banca: Luis da Cunha Lamb

Resumo: Vision Transformer models reach almost saturated performance on common benchmarks, but they still function largely as statistical black boxes, without built-in mechanisms to guarantee or check that their behavior is logically consistent with expert-defined domain constraints. Neurosymbolic frameworks aim to close this gap, but remain limited by two structural limitations: (i) manual construction of formal knowledge bases is labor-intensive and does not scale to complex domains, and (ii) no existing architecture incorporates symbolic constraints into Mixture-of-Experts (MoE) routing. This thesis advances neurosymbolic computer vision in three main dimensions, supported by four peer-reviewed studies, and evaluates all proposed methods on plant pathology benchmarks as a representative domain of knowledge-rich. First, it presents a systematic survey and taxonomy of more than 60 neurosymbolic AI methods for computer vision (2017–2025), establishing a unifying conceptual framework and identifying four open research directions that were subsequently addressed. Second, it introduces NSVIT, which reformulates fine-grained visual classification as a logical satisfiability problem, replacing cross-entropy loss with a satisfaction-aggregate objective over a Logic Tensor Network (LTN) knowledge base. Across three benchmarks (PlantVillage, Wheat Rust, PlantDoc; 5-fold cross-validation), NSVIT achieves classification performance comparable to a cross-entropy objective (Wilcoxon signed-rank test, $p > 0.05$), while additionally producing interpretable, per-instance SAT scores (PlantVillage: $0.861$; Wheat Rust: $0.874$; PlantDoc: $0.112$). These scores constitute actionable uncertainty estimates that also characterize the degree of alignment between the underlying knowledge base and the empirical data. Further, the work introduces a five-stage LLM-assisted pipeline for automatic construction of formal knowledge bases from the scientific literature. When applied to 35 publications at a total API cost of \$0.00248, this pipeline achieves 100\% extraction reproducibility and improves knowledge quality scores from 50/100 to 100/100 by ontology-based prompting aligned with AGROVOC and PATO. These elements are integrated within ViT-LoMoE, a logically informed mixture-of-experts (MoE) architecture whose hybrid routing combines a learned gating mechanism with ontological priors, thereby directing visual inputs to semantically specialized expert subnetworks. Logical Tensor Network (LTN) constraints are incorporated to enforce consistency through a differentiable SAT-based loss. A three-stage training protocol—masked autoencoder pre-training, supervised LoMoE optimization, and GRPO-based reinforcement-learning fine-tuning of the router which is evaluated from scratch on four plant pathology benchmarks covering both controlled and in-field imaging conditions. Across all backbone configurations, ViT-LoM oE consistently exceeds a ViT-MoE baseline. Performance gains reach $+2.44$ percentage points (pp) in Wheat Rust ($97.89 \pm 0.76\%$ vs.\ $95.45 \pm 1.49\%$), $+21.98$ pp in Cassava ($68.77 \pm 0.96\%$ vs.\ $46.79 \pm 0.46\%$), and $+1.73$ to $+2.17$ pp in PlantVillage ($99.30$–$99.39\%$). Phase-3 GRPO fine-tuning consistently improves accuracy ($+0.49$ to $+1.17$ pp on Wheat; $+1.06$ pp on Cassava; $+0.85$ pp in PlantDoc) without statistically significant degradation in logical constraint satisfaction ($|\Delta\mathrm{SAT}| \leq 0.05$ pp). Analysis of SAT metrics reveals four qualitatively distinct regimes of knowledge-base–data alignment and enables per-class diagnostics of annotation noise based on discrepancies between accuracy and SAT scores. Collectively, these findings demonstrate that logical constraints can effectively guide visual feature learning and expert specialization at negligible knowledge-engineering cost, producing a reproducible end-to-end neurosymbolic framework for interpretable visual recognition in knowledge-intensive domains.

Palavras-Chave: Neurosymbolic AI; Computer Vision; Mixture of Experts; Logic Tensor; Networks; Plant Disease Detection; Vision Transformers; Interpretability