Viviane P. Moreira – Publicações

Link para minhas publicações no DBLP

Artigos em Periódicos

·         Thanh Nguyen, Viviane Moreira, Huong Nguyen, Hoa Nguyen, Juliana Freire, Multilingual Schema Matching for Wikipedia Infoboxes. Proceedings of the VLDB Endowment.  vol. 5 No. 2, p. 133-144, 2011. Link para o artigo

Abstract:
Recent research has taken advantage of Wikipedia’s multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.

·         dos Santos, J.B., C.A. Heuser, V.P. Moreira, and L.K. Wives, Automatic threshold estimation for data matching applications. Information Sciences.  v. 181, p. 2685-2699, 2011. Link para o artigo

Abstract:
Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In this context, threshold definition is a central problem. This paper proposes a method for estimating the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. Based on the results of the proposed estimation process and the requirements of a specific application, a user is able to choose a suitable threshold value. The estimation process is based on a clustering phase performed over a data collection (or a sample thereof) and requires no human intervention since the choice of similarity threshold is based on the silhouette coefficient, which is an internal quality measure for clusters. An extensive set of experiments on artificial and real datasets demonstrates the effectiveness of the proposed approach. The results of the experiments show that in most cases the estimation error was below 10% in terms of precision and recall.

·         Dorneles, Carina F. ; Nunes, Marcos Freitas; Heuser, Carlos A. ; Moreira, Viviane P. ; da Silva, Altigran S. ; de Moura, Edleno S. . A strategy for allowing meaningful and comparable scores in approximate matching. Information Systems , v. 34, p. 673-689, 2009. Link para o artigo

Abstract:
Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score surpasses a given threshold, both data instances are considered as representing the same real-world object. These score values depend on the algorithm that implements the function and have no meaning to the user. In addition, score values generated by different functions are not comparable. This will potentially lead to problems when the scores returned by different similarity functions need to be combined for computing the similarity between records. In this article, we propose that thresholds should be defined in terms of the precision that is expected from the matching process rather than in terms of the raw scores returned by the similarity function. Precision is a widely known similarity metric and has a clear interpretation from the user's point of view. Our approach defines mappings from score values to precision values, which we call adjusted scores. In order to obtain such mappings, our approach requires training over a small dataset. Experiments show that training can be reused for different datasets on the same domain. Our results also demonstrate that existing methods for combining scores for computing the similarity between records may be enhanced if adjusted scores are used.

Keywords: Similarity querying; Data integration; Data cleaning; Entity resolution; Deduplication

·         RITT, M ; COSTA, A ; MERGEN, S ; ORENGO, V. M. An integer linear programming approach for approximate string comparisonEuropean Journal of Operational Research, 2009 Link para o artigo

Abstract:
We introduce a problem called maximum common characters in blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We show that this problem is NP-complete, but can nevertheless be solved satisfactorily using integer linear programming for instances of practical interest. Two integer linear formulations are proposed and compared in terms of their linear relaxations. We also compare the results of the approximate matching with other known measures such as the Levenshtein (edit) distance.

Keywords: Approximate string matching; Distance metric; Block edits; Block moves; Integer programming; Hop constraints

·         Silva, R.; Stasiu, R.K.; Orengo, V.M. and Heuser, C.A. Measuring quality of similarity functions in approximate data matchingJournal of Informetrics, 2007. 1(1) :p. 35-46. Link para o artigo

Abstract:
This paper presents a method for assessing the quality of similarity functions. The scenario taken into account is that of approximate data matching, in which it is necessary to determine whether two data instances represent the same real world object. Our method is based on the semi-automatic estimation of optimal threshold values. We propose two methods for performing such estimation. The first method is an algorithm based on a reward function, and the second is a statistical method. Experiments were carried out to validate the techniques proposed. The results show that both methods for threshold estimation produce similar results. The output of such methods was used to design a grading function for similarity functions. This grading function, called discernability, was used to compare a number of similarity functions applied to an experimental data set.

Keywords: Approximate data matching; Similarity functions; Retrieval evaluation

·         Orengo, V.M. and C.R. Huyck, Relevance Feedback and Cross-language Information Retrieval. Information Processing & Management, 2006. 42(5): p.1203-1217. Link para o artigo

Abstract:
This paper presents a study of relevance feedback in a cross-language information retrieval environment. We have performed an experiment in which Portuguese speakers are asked to judge the relevance of English documents; documents hand-translated to Portuguese and documents automatically translated to Portuguese. The goals of the experiment were to answer two questions (i) how well can native Portuguese searchers recognise relevant documents written in English, compared to documents that are hand translated and automatically translated to Portuguese; and (ii) what is the impact of misjudged documents on the performance improvement that can be achieved by relevance feedback. Surprisingly, the results show that machine translation is as effective as hand translation in aiding users to assess relevance in the experiment. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics.

Keywords: Cross-language information retrieval; Relevance feedback

·         Huyck, C.R. and V.M. Orengo, Information Retrieval and Categorisation Using a Cell Assembly Network. Neural Computing and Applications, 2005. 14(4): p. 282-289. Link para o artigo

Abstract:
Simulated networks of spiking leaky integrators are used to categorise and for Information Retrieval (IR). Neurons in the network are sparsely connected, learn using Hebbian learning rules, and are simulated in discrete time steps. Our earlier work has used these models to simulate human concept formation and usage, but we were interested in the model's applicability to real world problems, so we have done experiments on categorisation and IR. The results of the system show that congresspeople are correctly categorised 89% of the time. The IR systems have 40% average precision on the Time collection, and 28% on the Cranfield 1,400. All scores are comparable to the state of the art results on these tasks.

Keywords: Information retrieval - Categorisation - Neural network - Cell assembly - Hebbian learning

Capítulos de Livros

·         Orengo, V.M. and D. Santos, Diana. Radicalizadores versus analisadores morfológicos: Sobre a participação do Removedor de Sufixos da Língua Portuguesa nas Morfolimpíadas, in Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa. Lisboa, D. Santos, Editor. 2007, IST Press: Lisboa. ISBN:978-972-8469-60-8

 

Artigos em Congressos

·         VOLPE, I. ; Moreira, Viviane P. ; HUYCK, Christian. Cell Assemblies for Query Expansion in Information Retrieval. In: International Joint Conference on Neural Networks (IJCNN), 2011, San Jose. International Joint Conference on Neural Networks, 2011. Link para o artigo

Abstract:
One of the main tasks in Information Retrieval is to match a user query to the documents that are relevant for it.  This matching is challenging because in many cases the keywords the user chooses will be different from the words the authors of the relevant documents have used. Throughout the years, many approaches have been proposed to deal with this problem. One of the most popular consists in expanding the query with related terms with the goal of retrieving more relevant documents.  In this paper, we propose a new method in which a Cell Assembly model is applied for query expansion.  Cell Assemblies are reverberating circuits of neurons that can persist long beyond the initial stimulus has ceased. They learn through Hebbian Learning rules and have been used to simulate the formation and the usage of human concepts. We adapted the Cell Assembly model to learn relationships between the terms in a document collection. These relationships are then used to augment the original queries. Our experiments use standard Information Retrieval test collections and show that some queries significantly improved their results with our technique.

·         ACOSTA, O. C. ; VILLAVICENCIO, A. ; Moreira, Viviane P. . Identification and Treatment of Multiword Expressions applied to Information Retrieval. In: Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), 2011, Portland. Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), 2011. Link para o artigo

Abstract:
The extensive use of Multiword Expressions (MWE) in natural language texts prompts more detailed studies that aim for a more adequate treatment of these expressions. A MWE typically expresses concepts and ideas that usually cannot be expressed by a single word. Intuitively, with the appropriate treatment of MWEs, the results of an Information Retrieval (IR) system could be improved. The aim of this paper is to apply techniques for the automatic extraction of MWEs from corpora to index them as a single unit. Experimental results show improvements on the retrieval of relevant documents when identifying MWEs and treating them as a single indexing unit.

·         PEREIRA, R. C. ; MOREIRA,V.P. ; GALANTE, R. . A New Approach for Cross-Language Plagiarism Analysis. In: CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation, 2010, Padua. Proceedings of the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation, 2010. Link para o artigo

Abstract:
This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.

·         PEREIRA, R. C. ; MOREIRA,V.P. ; GALANTE, R. . UFRGS@PAN2010: Detecting External Plagiarism. In: PAN 2010 Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse,, 2010, Padua. Proceedings of the PAN 2010 Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse, 2010. Link para o artigo

Abstract:
This paper presents our approach to detect plagiarism in the PAN’10 competition. To accomplish this task we applied a method which aims at detecting external plagiarism cases. The method is specially designed to detect cross-language plagiarism and is composed by five phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. Our group got the seventh place in the competition with an overall score of 0.5175. It is important to notice that the final score was affected by our low recall (0.4036) which arose as a result of not detecting intrinsic plagiarism cases, which were also present in the competition corpus.

·         FLORES, F.; MOREIRA Viviane P.; Heuser, C. A. Assessing the Impact of Stemming Accuracy on Information Retrieval. In: International Conference on Computational Processing of Portuguese Language (PROPOR 2010), 2010 p. 11-20. Link para o artigo

Abstract:
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Our results show that some kind of correlation does exist, but it is not as strong as one might have expected.

·         GERALDO, A. P. ; MOREIRA, Viviane P. ; GONCALVES, M. A. . On-demand Associative Cross-Language Information Retrieval. In: SPIRE, 2009, Saariselka. Proceedings of SPIRE 2009 (LNCS 5721), 2009. p. 165-173. Link para o artigo

Abstract:
This paper proposes the use of algorithms for mining association rules as an approach for Cross-Language Information Retrieval. These algorithms have been widely used to analyse market basket data. The idea is to map the problem of finding associations between sales items to the problem of finding term translations over a parallel corpus. The proposal was validated by means of experiments using queries in two distinct languages: Portuguese and Finnish to retrieve documents in English. The results show that the performance of our proposed approach is comparable to the performance of the monolingual baseline and to query translation via machine translation, even though these systems employ more complex Natural Language Processing techniques. The combination between machine translation and our approach yielded the best results, even outperforming the monolingual baseline.

Keywords:  association rules, experimentation, performance measurement

·         GERALDO, A. P. ; ORENGO, Viviane Moreira . UFRGS@CLEF2008: Using Association Rules for Cross-Language Information Retrieval. In: Cross-Language Evaluation Forum (CLEF), 2008, Aarhus. Working Notes of CLEF2008, 2008. Link para o artigo

Abstract:
For UFRGS’s participation on the TEL task at CLEF2008, our aim was to assess the validity of using algorithms for mining association rules to find mappings between concepts on a Cross-Language Information Retrieval scenario. Our approach requires a sample of parallel documents to serve as the basis for the generation of the association rules. The results of the experiments show that the performance of our approach is not statistically different from the monolingual baseline in terms of mean average precision. This is an indication that association rules can be effectively used to map concepts between languages. We have also tested a modification to BM25 that aims at increasing the weight of rare terms. The results show that this modified version achieved better performance. The improvements were considered to be statistically significant in terms of MAP on our monolingual runs.

Keywords:  association rules, experimentation, performance measurement

·         ACOSTA, O. C. ; ORENGO, Viviane Moreira ; VILLAVICENCIO, A. . UFRGS@CLEF2008: Indexing Multiword Expressions for Information Retrieval. In: Cross-Language Evaluation Forum, 2008, Arhus. Working Notes of CLEF 2008, 2008. Link para o artigo

Abstract:
For UFRGS’s participation on CLEF’s Robust task, our aim was to assess the benefits of identifying and indexing Multiword Expressions (MWEs) for Information Retrieval. The approach used for MWE identification was totally statistical, based association measures such as Mutual Information and Chi-square. Contradicting our results on the training topics, the results on the test topics did not show any significant improvements. However, for some queries, the identification of MWEs was very important. We have also performed bilingual experiments which achieved 84% of their monolingual counterparts

Keywords:  experimentation, performance measurement, multiword expression

·         SANTOS, J. B. ; HEUSER, Carlos Alberto ; ORENGO, Viviane Moreira ; WIVES, L. K.  Automatic Threshold Estimation for Data Matching Applications. In: Simpósio Brasileiro de Banco de Dados, 2008, Campinas. Anais do SBBD 2008, 2008. Link para o artigo

Abstract:
Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.

Keywords:  similarity querying, clustering

·         Dorneles, C. F. ; Heuser, Carlos Alberto ; Orengo, Viviane Moreira ; Silva, A. S. ; Moura, E. S. . A Strategy for Allowing Meaningful and Comparable Scores in Approximate Matching. In: CIKM - Conference on Information and Knowledge Management, 2007, Lisboa, Portugalp. 303-312. Link para o artigo

Abstract:
The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.

Keywords: data cleaning - data integration - similarity querying

·         Heuser, Carlos Alberto ; Krieser, F. A. ; Orengo, Viviane Moreira . SimEval - A Tool for Evaluating the Quality of Similarity Functions. In: Tutorials, posters, panels and industrial contributions at the 26th International Conference on Conceptual Modeling - ER 2007. Auckland : ACS, 2007. v. 83. p. 71-76.  Link para o artigo

Abstract:
Approximate data matching applications typically use similarity functions to quantify the degree of likeness between two data instances. There are several similarity functions available, thus, it is often necessary to evaluate a number of them aiming at choosing the function that is more adequate to a specific application. This paper presents a tool that uses average precision and discernability to evaluate the quality ofsimilarity functions over a data set.

Keywords: approximate data matching - similarity functions 

·         Orengo, V.M, L. Buriol, and A. Coelho. A study on the use of Stemming for Monolingual Ad-Hoc Portuguese Information Retrieval , in Evaluation of Multilingual and Multi-modal Information Retrieval, C. Peters, et al., Editors. 2007, Springer Berlin / Heidelberg. p. 91-98. CLEF 2006, Alicante. Link para o artigo

Abstract:
For UFRGS’s first participation on CLEF our goal was to compare the performance of heavier and lighter stemming strategies using the Portuguese data collections for Monolingual Ad-hoc retrieval. The results show that the safest strategy was to use the lighter alternative (reducing plural forms only). On a query-by-query analysis, full stemming achieved the highest improvement but also the biggest decrease in performance when compared to no stemming. In addition, statistical tests showed that the only significant improvement both in terms of mean average precision and precision at ten was achieved by our lighter stemmer.

Keywords: Information retrieval - stemming algoritms - evaluation

·         Orengo, V.M. and C.R. Huyck, Portuguese-English Cross-Language Information Retrieval Using Latent Semantic Indexing, in Advances in Cross-Language Information Retrieval - Third Workshop of the Cross-Language Evaluation Forum, CLEF 2002 (LNCS 2785), C. Peters, et al., Editors. 2003, Springer: Rome. Link para o artigo 

Abstract:
This paper reports the work of Middlesex University in the CLEF bilingual task. We have carried out experiments using Portuguese queries to retrieve documents in English. The approach used was Latent Semantic Indexing, which is an automatic method not requiring dictionaries or thesauri. We have also run a monolingual version of the system to work as a baseline. Here we describe in detail the methods used and give an analysis of the results obtained.

Keywords: cross-language information retrieval, stemming algorithms, latent semantic indexing

·         Orengo, V.M. and C.R. Huyck, A Stemming Algorithm for the Portuguese Language, in 8th International Symposium on String Processing and Information Retrieval (SPIRE). 2001: Laguna de San Raphael, Chile. p. 183-193. Link para o artigo

Abstract:
Stemming algorithms are traditionally used in Information Retrieval with the goal of enhancing recall, as they conflate the variant forms of a word into a common representation. This paper describes the development of a simple and effective suffix-stripping algorithm for Portuguese. The stemmer is evaluated using a method proposed by Paice [9]. The results show that it performs significantly better than the Portuguese version of the Porter algorithm.

Keywords: stemming algoritms, evaluation

·         Orengo, V.M., Cross-Language Information Retrieval and Digital Libraries, in C@MDX, C. Nielsen and V.M. Orengo, Editors. 2000, Middlesex University: London, UK

 

·         Moreira, V.P. and N. Edelweiss. Schema Versioning: Queries to the Generalised Temporal Database System. In: INTERNATIONAL WORKSHOP ON SPATIO-TEMPORAL DATA MODELS AND LANGUAGES. 1999. Florence, Italy: IEEE.

 

·         Moreira, V.P. and N. Edelweiss, Queries to Temporal Databases Supporting Schema Versioning, in SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD'99). 1999: Florianópolis. p. 299-313

 

·         Moreira, V.P. and N. Edelweiss, Versioning and The Generalised Temporal Database, in XXV CONFERENCIA LATINO-AMERICANA DE INFORMATICA - CLEI. 1999: Assuncion, Paraguai. p. 111-122

 
 

Outras Publicações

·         Orengo, V.M., Assessing Relevance Using Automatically Translated Documents for Cross-Language Information Retrieval, in School of Computing Science. 2004, Middlesex University: London, UK. p. 258  PhD Thesis

·         Moreira, V.P., Evolução de Esquemas em Bancos de Dados Temporais. 1997, UFRGS: Porto Alegre (Brazil). Trabalho Individual.  (In Portuguese)

·         Moreira, V.P., Consultas a Bancos de Dados Temporais que suportam Versionamento de Esquemas, in Instituto de Informática. 1999, UFRGS: Porto Alegre (Brazil)  Masters Dissertation (in portuguese)

·         Moreira, V.P. and N. Edelweiss, Consultas a bancos de dados temporais que suportam versionamento de esquemas, in SEMANA ACADEMICA DO CPGCC. 1998: Porto Alegre - RS-Brasil (in portuguese)