Aluno(a): Higor Moreira
Orientador(a): Viviane Pereira Moreira
Título: Named Entity Recognition in the Portuguese Oil & Gas Domain: Creating Resources and Evaluating Data Augmentation Methods
Linha de Pesquisa: Processamento de Linguagem Natural
Data: 06/03/2026
Hora: 09:00
Local: Esta banca ocorrerá de forma remota. Acesso público disponibilizado pelo link https://mconf.ufrgs.br/webconf/00149248.
Banca Examinadora:
-Daniela Barreiro Claro (UFBA)
-Ellen Polliana Ramos Souza Pereira (UFRPE)
-Dennis Giovani Balreira (UFRGS)
Presidente da Banca: Viviane Pereira Moreira
Resumo: The Oil and Gas (O&G) industry handles vast amounts of unstructured textual data, where Named Entity Recognition (NER) is essential for structured knowledge discovery and technical decision-making. Despite the strategic importance of this sector, the lack of standardized linguistic resources in Portuguese hinders the development of robust domain-specific applications. This work introduces PetroGeoNER, a unified and revised dataset created by consolidating academic and technical sources, specifically GeoCorpus-3 and PetroNER. The methodology employed a five-stage refinement pipeline, including entity mapping and manual verification to establish a high-quality benchmark. Several NER models were trained on the dataset. Results indicate that XLM-RoBERTa Large achieved the best performance, with an F1 Micro score of 0.90. Furthermore, the model was applied to the REGIS collection, demonstrating its practical utility by identifying millions of technical entities across geoscientific documents. Finally, we also evaluated state-of-the-art Transformer models and investigated Data Augmentation (DA) techniques, ranging from simple heuristics to Large Language Models (LLMs), to address the inherent data scarcity in specialized scenarios. The generative and heuristic DA methods provided measurable improvements in low-resource settings, with PP-LLM and traditional heuristics emerging as top-performing strategies. This research contributes a reliable resource and an evaluation framework to support NLP applications in the geoscientific field.
Palavras-Chave: Named Entity Recognition, Data Augmentation, Large Language Models, Dataset Creation