Português English

Dissertação de Mestrado de Lucas Lima de Oliveira

Detalhes do Evento


Aluno: Lucas Lima de Oliveira
: Profª. Drª. Viviane Pereira Moreira

Título: Information Retrieval in the Geoscientific Domain: Building and Evaluating Resources
Linha de Pesquisa: Mineração, Integração e Análise de Dados

Data: 27/01/2022
Horário: 9h30min.
Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://mconf.ufrgs.br/webconf/00149248

Banca Examinadora:
– Profª. Drª. Carina Friedrich Dorneles (UFSC)
– Profª. Drª. Mara Abel (UFRGS)
– Prof. Dr. Leandro Krug Wives (UFRGS)
Presidente da Banca: Profª. Drª. Viviane Pereira Moreira

Abstract: The Portable Document Format (PDF) has become the de facto standard for document storage and sharing. Scientific papers, project proposals, contracts, books, legal documents are typically stored and distributed as PDF files. While extracting the textual contents of born-digital PDF documents can be done with high accuracy, if the document consists of a scanned image, Optical Character Recognition (OCR) is typically required. The output of OCR can be noisy, especially when the quality of the scanned image is poor– really common on historical documents –, which in turn can impact downstream tasks such as Information Retrieval (IR). Post-processing OCR-ed documents is an alternative to fix extraction errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR extraction and correction on IR. We compared different extraction and correction methods on OCR-ed data from real scanned documents. To evaluate IR tasks, the standard paradigm requires a test collection with documents, queries, and relevance judgments. Creating test collections requires significant human effort, mainly for providing relevance judgments. As a result, there are still many domains and languages that, to this day, lack a proper evaluation testbed. Portuguese is an example of a major world language that has been overlooked in terms of IR research – the only test collection available is composed of news articles from 1994 and a hundred queries. With the aim of bridging this gap, we developed REGIS (Retrieval Evaluation for Geoscientific Information Systems), a test collection for the geoscientific domain in Portuguese.REGIS contains 20K documents and 34 query topics along with relevance assessments. Our results from the experiments with REGIS showed that on average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed that most query topics improved with error correction.

Keywords: Information retrieval. test collection. OCR errors. error correction.