Português English
Contato
Publicado em: 25/03/2010

Dissertação de Mestrado em Modelagem Conceitual e Bancos de Dados

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL
INSTITUTO DE INFORMÁTICA
PROGRAMA DE POS-GRADUAÇÃO EM COMPUTAÇÃO
———————————————————
DEFESA DE DISSERTAÇÃO DE MESTRADO

Aluno: Rafael Corezola Pereira
Orientadora: Profa. Dra. Viviane Pereira Moreira
Co-orientadora: Renata de Matos Galante

Titulo: Cross-Language Plagiarism Detection
Linha de Pesquisa: Modelagem Conceitual e Bancos de Dados

Data: 30/08/2010
Hora: 16:30
Local: Auditório Verde

Banca Examinadora:
Profa. Dra. Solange Oliveira Rezende (USP)
Prof. Dr. Leandro Krug Wives (UFRGS)
Prof. Dr. Carlos Alberto Heuser (UFRGS)

Presidente da Banca: Profa. Dra. Viviane Pereira Moreira

RESUMO:
Plagiarism is one of the most serious forms of academic misconduct. It is defined as “the use of another person’s written work without acknowledging the source”. As a countermeasure to this problem, there are several methods that attempt to automatically detect plagiarism between documents. In this context, this work proposes a new method for Cross-Language Plagiarism Analysis. The method aims at detecting external plagiarism cases, i.e., it tries to detect the plagiarized passages in the suspicious documents (the documents to be investigated) and their corresponding text fragments in the source documents (the original documents). To accomplish this task, we propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. Since the method is designed to detect cross-language plagiarism, we used a language guesser to identify the language of the documents and an automatic translation tool to translate all the documents in the collection into a common language (so they can be analyzed in a uniform way). After language normalization, we applied a classification algorithm in order to build a model that is able to differentiate a plagiarized text passage from a non-plagiarized one. Once the classifier is trained, the suspicious documents can be analyzed. An information retrieval system is used to retrieve, based on passages extracted from each suspicious document, the passages from the original documents that are more likely to be the source of plagiarism. Only after the candidate passages are retrieved, the plagiarism analysis is performed. Finally, a post-processing technique is applied in the reported results in order to join the contiguous plagiarized passages. We evaluated our method using three freely available test collections. Two of them were created for the PAN competitions (PAN’09 and PAN’10), which are international competitions on plagiarism detection. Since only a small percentage of these two collections contained cross-language plagiarism cases, we also created an artificial test collection especially designed to contain this kind of offense. We named the test collection ECLaPA (Europarl Cross-Language Plagiarism Analysis). The results achieved while analyzing these collections showed that the proposed method is a viable approach to the task of cross-language plagiarism analysis.

PALAVRAS-CHAVE: Plagiarism, cross-language plagiarism detection, plagiarism test collections