Viviane P. Moreira – Publicações
· BECKER, KARIN ; Moreira, Viviane P. ; DOS SANTOS, ALINE G.L. . Multilingual emotion classification using supervised learning: Comparative experiments. Information Processing & Management, v. 53, p. 684-704, 2017. Link para o artigo
The importance of emotion mining is acknowledged in a wide range of new applications, thus broadening the potential market already proven for opinion mining. However, the lack of resources for languages other than English is even more critical for emotion mining. In this article, we investigate whether Multilingual Sentiment Analysis delivers reliable and effective results when applied to emotions. For this purpose, we developed experiments involving machine translations over corpora originally written in two languages. Our experimental framework for emotion classification assesses variations on (i) the language of the original text and its translations; (ii) strategies to combine multiple languages to overcome losses due to translation; (iii) options for data pre-processing (tokenization, feature representation and feature selection); and (iv) classification algorithms, including meta-classifiers. The results show that emotion classification performance improve significantly with the use of texts in multiple languages, particularly by adopting a stacking of weak monolingual classifiers. Our study also sheds light into the impacts of data preparation strategies and their combination with classification algorithms, and compares differences between polarity and emotion classification according to the same experimental settings.
· MERGEN, SERGIO L. S. ; Moreira, Viviane P. . DuelMerge: Merging with Fewer Moves. Computer Journal (Print), v. 67, p. 1-8, 2017. Link para o artigo
This work proposes duelmerge, a stable merging algorithm that is asymptotically optimal in the number of comparisons and performs O(nlog2(n)) moves. Unlike other partition-based algorithms, we only allow blocks of equal sizes to be swapped, which reduces the number of moves required. We performed experiments comparing duelmerge against a number of baselines including recmerge, the standard merging solution for programming languages such as C, and some more recent approaches. The results show that our proposed algorithm performs fewer moves than other stable solutions. Experiments employing duelmerge within MergeSort confirmed our positive results in terms of moves, comparisons and runtime.
· Pertile, S. d. L., Moreira, V. P. and Rosso, P. (2016), Comparing and combining Content- and Citation-based approaches for plagiarism detection. J Assn Inf Sci Tec. Link para o artigo
The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scientific papers. The contributions of this paper are twofold: (a) we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and (b) we compare content and citation-based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carry out experiments with real data sets of scientific papers and concluded that a combination of the methods can be beneficial.
· Anderson Uilian Kauer, Viviane Pereira Moreira. Using information retrieval for sentiment polarity prediction. Expert Syst. Appl. 61: 282-289 (2016). Link para o artigo
Social networks such as Twitter are used by millions of people who express their opinions on a variety of topics. Consequently, these media are constantly being examined by sentiment analysis systems which aim at classifying the posts as positive or negative. Given the variety of topics discussed and the short length of the posts, the standard approach of using the words as features for machine learning algorithms results in sparse vectors. In this work, we propose using features derived from the ranking generated by an Information Retrieval System in response to a query consisting of the post that needs to be classified. Our system can be fully automatic, has only 24 features, and does not depend on expensive resources. Experiments on real datasets have shown that a classifier that relies solely on these features outperforms established baselines and can reach accuracies comparable to the state-of-the-art approaches which are more costly.
· Felipe N. Flores, Viviane Pereira Moreira. Assessing the impact of Stemming Accuracy on Information Retrieval - A multilingual perspective. Inf. Process. Manage. 52(5): 840-854 (2016). Link para o artigo
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval systems. In this article, we evaluate various stemming algorithms, in four languages, in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Experiments in English, French, Portuguese, and Spanish show that this is not always the case, as stemmers with higher error rates yield better retrieval quality. As a byproduct, we also identified the most accurate stemmers and the best for Information Retrieval purposes.
· Gustavo Zanini Kantorski, Viviane Pereira Moreira, Carlos Alberto Heuser. Automatic Filling of Hidden Web Forms: A Survey. SIGMOD Record 44(1): 24-35 (2015). Link para o artigo
A significant part of the information available on the Web is stored in online databases which compose what is known as Hidden Web or Deep Web. In order to access information from the Hidden Web, one must fill an HTML form that is submitted as a query to the underlying database. In recent years, many works have focused on how to automate the process of form filling by creating methods for choosing values to fill the fields in the forms. This is a challenging task since forms may contain fields for which there are no predefined values to choose from. This article presents a survey of methods for Web Form Filling, analyzing the existing solutions with respect to the type of forms that they handle and the filling strategy adopted. We provide a comparative analysis of 15 key works in this area and discuss directions for future research.
· Edson R. D. Weren, Anderson U. Kauer, Lucas Mizusaki, Viviane Pereira Moreira, José Palazzo Moreira de Oliveira, Leandro Krug Wives: Examining Multiple Features for Author Profiling. JIDM 5(3): 266-279 (2014). Link para o artigo
Authorship analysis aims at classifying texts based on the stylistic choices of their authors. The idea is to discover characteristics of the authors of the texts. This task has a growing importance in forensics, security, and marketing. In this work, we focus on discovering age and gender from blog authors. With this goal in mind, we analyzed a large number of features – ranging from Information Retrieval to Sentiment Analysis. This paper reports on the usefulness of these features. Experiments on a corpus of over 236K blogs show that a classifier using the features explored here have outperformed the state-of-the art. More importantly, the experiments show that the Information Retrieval features proposed in our work are the most discriminative and yield the best class predictions.
· Moraes, Mauricio C. ; Heuser, Carlos A. ; Moreira, Viviane P. ; Barbosa, Denilson. Prequery Discovery of Domain-Specific Query Forms: A Survey. IEEE Transactions on Knowledge and Data Engineering (Print), v. 25, p. 1830-1848, 2013. Link para o artigo
The discovery of HTML query forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-specific query forms that do not involve form submission. We detail these methods and discuss how form discovery has become increasingly more automated over time. We conclude with a forecast of what we believe are the immediate next steps in this trend.
Wikipedia is a public encyclopedia composed of millions of articles written daily by volunteer authors from different regions of the world. The articles contain links called cross-language links which relate corresponding articles across different languages. This feature is extremely useful for applications that work with automatic translation and multilingual information retrieval as it allows the assembly of comparable corpora. Thus, it is important to have a mechanism that automatically creates such links. This has been motivating the development of techniques to identify missing cross-language links. In this article, we present CLLFinder, an approach for finding missing cross-language links. The approach makes use of the links between categories and of the transitivity between existing cross-language links, as well as textual features extracted from the articles. Experiments using one million articles from the English and Portuguese Wikipedias attest the viability of CLLFinder. The results show that our approach has a recall of 96% and a precision of 98%, outperforming the baseline system, even though we employ simpler and fewer features.
· Thanh Nguyen, Viviane Moreira, Huong Nguyen, Hoa Nguyen, Juliana Freire, Multilingual Schema Matching for Wikipedia Infoboxes. Proceedings of the VLDB Endowment. vol. 5 No. 2, p. 133-144, 2011. Link para o artigo
Recent research has taken advantage of Wikipedia’s multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.
Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In this context, threshold definition is a central problem. This paper proposes a method for estimating the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. Based on the results of the proposed estimation process and the requirements of a specific application, a user is able to choose a suitable threshold value. The estimation process is based on a clustering phase performed over a data collection (or a sample thereof) and requires no human intervention since the choice of similarity threshold is based on the silhouette coefficient, which is an internal quality measure for clusters. An extensive set of experiments on artificial and real datasets demonstrates the effectiveness of the proposed approach. The results of the experiments show that in most cases the estimation error was below 10% in terms of precision and recall.
· Dorneles, Carina F. ; Nunes, Marcos Freitas; Heuser, Carlos A. ; Moreira, Viviane P. ; da Silva, Altigran S. ; de Moura, Edleno S. . A strategy for allowing meaningful and comparable scores in approximate matching. Information Systems , v. 34, p. 673-689, 2009. Link para o artigo
Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score surpasses a given threshold, both data instances are considered as representing the same real-world object. These score values depend on the algorithm that implements the function and have no meaning to the user. In addition, score values generated by different functions are not comparable. This will potentially lead to problems when the scores returned by different similarity functions need to be combined for computing the similarity between records. In this article, we propose that thresholds should be defined in terms of the precision that is expected from the matching process rather than in terms of the raw scores returned by the similarity function. Precision is a widely known similarity metric and has a clear interpretation from the user's point of view. Our approach defines mappings from score values to precision values, which we call adjusted scores. In order to obtain such mappings, our approach requires training over a small dataset. Experiments show that training can be reused for different datasets on the same domain. Our results also demonstrate that existing methods for combining scores for computing the similarity between records may be enhanced if adjusted scores are used.
Keywords: Similarity querying; Data integration; Data cleaning; Entity resolution; Deduplication
We introduce a problem called maximum common characters in blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We show that this problem is NP-complete, but can nevertheless be solved satisfactorily using integer linear programming for instances of practical interest. Two integer linear formulations are proposed and compared in terms of their linear relaxations. We also compare the results of the approximate matching with other known measures such as the Levenshtein (edit) distance.
Keywords: Approximate string matching; Distance metric; Block edits; Block moves; Integer programming; Hop constraints
This paper presents a method for assessing the quality of similarity functions. The scenario taken into account is that of approximate data matching, in which it is necessary to determine whether two data instances represent the same real world object. Our method is based on the semi-automatic estimation of optimal threshold values. We propose two methods for performing such estimation. The first method is an algorithm based on a reward function, and the second is a statistical method. Experiments were carried out to validate the techniques proposed. The results show that both methods for threshold estimation produce similar results. The output of such methods was used to design a grading function for similarity functions. This grading function, called discernability, was used to compare a number of similarity functions applied to an experimental data set.
Keywords: Approximate data matching; Similarity functions; Retrieval evaluation
This paper presents a study of relevance feedback in a cross-language information retrieval environment. We have performed an experiment in which Portuguese speakers are asked to judge the relevance of English documents; documents hand-translated to Portuguese and documents automatically translated to Portuguese. The goals of the experiment were to answer two questions (i) how well can native Portuguese searchers recognise relevant documents written in English, compared to documents that are hand translated and automatically translated to Portuguese; and (ii) what is the impact of misjudged documents on the performance improvement that can be achieved by relevance feedback. Surprisingly, the results show that machine translation is as effective as hand translation in aiding users to assess relevance in the experiment. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics.
Keywords: Cross-language information retrieval; Relevance feedback
Simulated networks of spiking leaky integrators are used to categorise and for Information Retrieval (IR). Neurons in the network are sparsely connected, learn using Hebbian learning rules, and are simulated in discrete time steps. Our earlier work has used these models to simulate human concept formation and usage, but we were interested in the model's applicability to real world problems, so we have done experiments on categorisation and IR. The results of the system show that congresspeople are correctly categorised 89% of the time. The IR systems have 40% average precision on the Time collection, and 28% on the Cranfield 1,400. All scores are comparable to the state of the art results on these tasks.
Keywords: Information retrieval - Categorisation - Neural network - Cell assembly - Hebbian learning
· Orengo, V.M. and D. Santos, Diana. Radicalizadores versus analisadores morfológicos: Sobre a participação do Removedor de Sufixos da Língua Portuguesa nas Morfolimpíadas, in Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa. Lisboa, D. Santos, Editor. 2007, IST Press: Lisboa. ISBN:978-972-8469-60-8
· SUAREZ, D. ; MOREIRA, V. P. . Identifying Sentiment-Based Contradictions. In: Simpósio Brasileiro de Bancos de Dados, 2016, Salvador. Proceedings of the 31st Brazilian Symposium on Databases, 2016. p. 76-87. Link para o artigo
Contradiction Analysis is a relatively new multidisciplinary and complex area with the main goal of identifying contradictory pieces of text. It can be addressed from the perspectives of different research areas such as Natural Language Processing, Opinion Mining, Information Retrieval, and Information Extraction. This paper focuses on the problem of detecting sentiment-based contradictions which occur in the sentences of a given review text. Unlike other types of contradictions, the detection of sentiment-based contradictions can be tackled as a post-processing step in the traditional sentiment analysis task. In this context, we adapted and extended an existing contradiction analysis framework by filtering its results to remove the reviews that are erroneously labelled as contradictory. The filtering method is based on two simple term similarity algorithms. An experimental evaluation on real product reviews has shown proportional improvements of up to 30% in classification accuracy and 26% in the precision of contradiction detection.
· KAUER, A. ; Moreira, Viviane P. . UFRGS: Identifying Categories and Targets in Customer Reviews. In: International Workshop on Semantic Evaluations (SemEval), 2015, Denver, Colorado. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015. Link para o artigo
This paper reports on our participation in SemEval-2015 Task 12, which was devoted to Aspect-Based Sentiment Analysis. Participants were required to identify the category (entity and attribute), the opinion target, and the polarity of customer reviews. The system we built relies on classification algorithms to identify aspect categories and on a set of rules to identify the opinion target. We propose a two-phase classification approach for category identification and use a simple method for polarity detection. Our results outperform the baseline in many cases, which means our system could be used as an alternative for aspect classification.
· Alan Souza, Viviane Pereira Moreira, Carlos Alberto Heuser: ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF. ACM Symposium on Document Engineering 2014: 121-130 Link para o artigo
Most scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata extraction has been identified as one of the most challenging tasks in document engineering. This work proposes Artic, a method for metadata extraction from scientific papers which employs a two-layer probabilistic framework based on Conditional Random Fields. The first layer aims at identifying the main sections with metadata information, and the second layer finds, for each section, the corresponding metadata. Given a PDF file containing a scientific paper, Artic extracts the title, author names, emails, affiliations, and venue information. We report on experiments using 100 real papers from a variety of publishers. Our results outperformed the state-of-the-art system used as the baseline, achieving a precision of over 99%.
· Bruno Laranjeira, Viviane Pereira Moreira, Aline Villavicencio, Carlos Ramisch, Maria José Finatto: Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them. LREC 2014: 3572-3578. Link para o artigo
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.
Research in external plagiarism detection is mainly concerned with the comparison of the textual contents of a suspicious document against the contents of a collection of original documents. More recently, methods that try to detect plagiarism based on citation patterns have been proposed. These methods are particularly useful for detecting plagiarism in scientific publications. In this work, we assess the value of identifying co-occurrences in citations by checking whether this method can identify cases of plagiarism in a dataset of scientific papers. Our results show that most the cases in which co-occurrences were found indeed correspond to plagiarised passages.
The state-of-the-art in domain-specific Web form discovery relies on supervised methods requiring substantial human effort in providing training examples, which limits their applicability in practice. This paper proposes an effective alternative to reduce the human effort: obtaining high-quality domain-specific training forms. In our approach, the only user input is the domain of interest; we use a search engine and a focused crawler to locate query forms which are fed as training data into supervised form classifiers. We tested this approach thoroughly, using thousands of real Web forms from six domains, including a representative subset of a publicly available form base to validate this approach. The results reported in this paper show that it is feasible to mitigate the demanding manual work required by some methods of the current state-of-the-art in form discovery, at the cost of a negligible loss in effectiveness.
· Gustavo Zanini Kantorski, Tiago Guimaraes Moraes, Viviane Pereira Moreira, Carlos Alberto Heuser: Choosing Values for Text Fields in Web Forms. ADBIS (2) 2012: 125-136. Link para o artigo
Since the only way to gain access to Hidden Web data is through form submission, one of the challenges is how to fill Web forms automatically. In this paper, we propose algorithms which address this challenge. We describe an efficient method to select good values for text fields and a technique which minimizes the number of form submissions and simultaneously maximizes the number of rows retrieved from the underlying database. Experiments using real Web forms show the advantages of our proposed approaches.
· Thanh Hoang Nguyen, Huong Dieu Nguyen, Viviane Moreira, Juliana Freire: Clustering Wikipedia infoboxes to discover their types. CIKM 2012: 2134-2138. Link para o artigo
Wikipedia has emerged as an important source of structured information on
the Web. But while the success of Wikipedia can be attributed in part to the
simplicity of adding and modifying content, this has also created challenges
when it comes to using, querying, and integrating the information. Even though
authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not
follow the guidelines or follow them loosely. This leads to undesirable
effects, such as template duplication, heterogeneity, and schema drift. As a
step towards addressing this problem, we propose a new unsupervised approach
for clustering Wikipedia infoboxes. Instead of
relying on manually assigned categories and template labels, we use the
structured information available in infoboxes to
group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is
effective and produces high quality clusters.
· Solange de L. Pertile, Viviane Pereira Moreira: A Test Collection to Evaluate Plagiarism by Missing or Incorrect References. CLEF 2012: 141-143. Link para o artigo
In recent years, several methods and tools been developed together with test collections to aid in plagiarism detection. However, both methods and collections have focused on content analysis, overlooking citation analysis. In this paper, we aim at filling this gap and present a test collection with cases of plagiarism by missing and incorrect references. The collection contains automatically generated academic papers in which passages from other documents have been inserted. Such passages were either: adequately referenced (i.e., not plagiarized), not referenced, or incorrectly referenced. Annotation files identifying each passage enable the evaluation of plagiarism detection systems.
One of the main tasks in Information Retrieval is to match a user query to the documents that are relevant for it. This matching is challenging because in many cases the keywords the user chooses will be different from the words the authors of the relevant documents have used. Throughout the years, many approaches have been proposed to deal with this problem. One of the most popular consists in expanding the query with related terms with the goal of retrieving more relevant documents. In this paper, we propose a new method in which a Cell Assembly model is applied for query expansion. Cell Assemblies are reverberating circuits of neurons that can persist long beyond the initial stimulus has ceased. They learn through Hebbian Learning rules and have been used to simulate the formation and the usage of human concepts. We adapted the Cell Assembly model to learn relationships between the terms in a document collection. These relationships are then used to augment the original queries. Our experiments use standard Information Retrieval test collections and show that some queries significantly improved their results with our technique.
O. C. ; VILLAVICENCIO, A. ; Moreira, Viviane P. . Identification and Treatment of Multiword Expressions
applied to Information Retrieval. In:
Multiword Expressions: from Parsing and Generation to the Real World (MWE
The extensive use of Multiword Expressions (MWE) in natural language texts prompts more detailed studies that aim for a more adequate treatment of these expressions. A MWE typically expresses concepts and ideas that usually cannot be expressed by a single word. Intuitively, with the appropriate treatment of MWEs, the results of an Information Retrieval (IR) system could be improved. The aim of this paper is to apply techniques for the automatic extraction of MWEs from corpora to index them as a single unit. Experimental results show improvements on the retrieval of relevant documents when identifying MWEs and treating them as a single indexing unit.
R. C. ; MOREIRA,V.P. ; GALANTE, R. . A New Approach for Cross-Language Plagiarism Analysis.
In: CLEF 2010 Conference on Multilingual and
Multimodal Information Access Evaluation, 2010,
This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.
R. C. ; MOREIRA,V.P. ; GALANTE, R. . UFRGS@PAN2010: Detecting External Plagiarism.
2010 Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse,, 2010,
This paper presents our approach to detect plagiarism in the PAN’10 competition. To accomplish this task we applied a method which aims at detecting external plagiarism cases. The method is specially designed to detect cross-language plagiarism and is composed by five phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. Our group got the seventh place in the competition with an overall score of 0.5175. It is important to notice that the final score was affected by our low recall (0.4036) which arose as a result of not detecting intrinsic plagiarism cases, which were also present in the competition corpus.
· FLORES, F.; MOREIRA Viviane P.; Heuser, C. A. Assessing the Impact of Stemming Accuracy on Information Retrieval. In: International Conference on Computational Processing of Portuguese Language (PROPOR 2010), 2010 p. 11-20. Link para o artigo
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Our results show that some kind of correlation does exist, but it is not as strong as one might have expected.
· GERALDO, A. P. ; MOREIRA, Viviane P. ; GONCALVES, M. A. . On-demand Associative Cross-Language Information Retrieval. In: SPIRE, 2009, Saariselka. Proceedings of SPIRE 2009 (LNCS 5721), 2009. p. 165-173. Link para o artigo
This paper proposes the use of algorithms for mining association rules as an approach for Cross-Language Information Retrieval. These algorithms have been widely used to analyse market basket data. The idea is to map the problem of finding associations between sales items to the problem of finding term translations over a parallel corpus. The proposal was validated by means of experiments using queries in two distinct languages: Portuguese and Finnish to retrieve documents in English. The results show that the performance of our proposed approach is comparable to the performance of the monolingual baseline and to query translation via machine translation, even though these systems employ more complex Natural Language Processing techniques. The combination between machine translation and our approach yielded the best results, even outperforming the monolingual baseline.
Keywords: association rules, experimentation, performance measurement
A. P. ; ORENGO, Viviane Moreira . UFRGS@CLEF2008: Using Association Rules for
Cross-Language Information Retrieval. In:
Cross-Language Evaluation Forum (CLEF), 2008,
For UFRGS’s participation on the TEL task at CLEF2008, our aim was to assess the validity of using algorithms for mining association rules to find mappings between concepts on a Cross-Language Information Retrieval scenario. Our approach requires a sample of parallel documents to serve as the basis for the generation of the association rules. The results of the experiments show that the performance of our approach is not statistically different from the monolingual baseline in terms of mean average precision. This is an indication that association rules can be effectively used to map concepts between languages. We have also tested a modification to BM25 that aims at increasing the weight of rare terms. The results show that this modified version achieved better performance. The improvements were considered to be statistically significant in terms of MAP on our monolingual runs.
Keywords: association rules, experimentation, performance measurement
O. C. ; ORENGO, Viviane Moreira ; VILLAVICENCIO, A. . UFRGS@CLEF2008: Indexing Multiword Expressions for
Information Retrieval. In: Cross-Language Evaluation Forum, 2008,
For UFRGS’s participation on CLEF’s Robust task, our aim was to assess the benefits of identifying and indexing Multiword Expressions (MWEs) for Information Retrieval. The approach used for MWE identification was totally statistical, based association measures such as Mutual Information and Chi-square. Contradicting our results on the training topics, the results on the test topics did not show any significant improvements. However, for some queries, the identification of MWEs was very important. We have also performed bilingual experiments which achieved 84% of their monolingual counterparts
Keywords: experimentation, performance measurement, multiword expression
· SANTOS, J. B. ; HEUSER, Carlos Alberto ; ORENGO, Viviane Moreira ; WIVES, L. K. Automatic Threshold Estimation for Data Matching Applications. In: Simpósio Brasileiro de Banco de Dados, 2008, Campinas. Anais do SBBD 2008, 2008. Link para o artigo
Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.
Keywords: similarity querying, clustering
C. F. ; Heuser, Carlos Alberto ; Orengo,
Viviane Moreira ; Silva, A. S. ; Moura, E. S. . A Strategy for Allowing Meaningful and Comparable Scores in Approximate Matching.
In: CIKM - Conference on Information
and Knowledge Management, 2007,
The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as representing the same real world object. The score values returned by a similarity function depend on the algorithm that implements the function and have no meaning to the user (apart from the fact that a higher similarity value means that two data instances are more similar). In this paper, we propose that instead of defining the threshold in terms of the scores returned by a similarity function, the user specifies the precision that is expected from the matching process. Precision is a well known quality measure and has a clear interpretation from the user's point of view. Our approach relies on mapping between similarity scores and precision values based on a training data set. Experimental results show the training may be executed against a representative data set, and reused for other databases from the same domain.
Keywords: data cleaning - data integration - similarity querying
· Heuser, Carlos Alberto ; Krieser, F. A. ; Orengo, Viviane Moreira . SimEval - A Tool for Evaluating the Quality of Similarity Functions. In: Tutorials, posters, panels and industrial contributions at the 26th International Conference on Conceptual Modeling - ER 2007. Auckland : ACS, 2007. v. 83. p. 71-76. Link para o artigo
Approximate data matching applications typically use similarity functions to quantify the degree of likeness between two data instances. There are several similarity functions available, thus, it is often necessary to evaluate a number of them aiming at choosing the function that is more adequate to a specific application. This paper presents a tool that uses average precision and discernability to evaluate the quality ofsimilarity functions over a data set.
Keywords: approximate data matching - similarity functions
V.M, L. Buriol, and A. Coelho. A study on the use of Stemming for Monolingual Ad-Hoc Portuguese
Information Retrieval , in Evaluation of Multilingual and
Multi-modal Information Retrieval, C. Peters, et al., Editors. 2007,
For UFRGS’s first participation on CLEF our goal was to compare the performance of heavier and lighter stemming strategies using the Portuguese data collections for Monolingual Ad-hoc retrieval. The results show that the safest strategy was to use the lighter alternative (reducing plural forms only). On a query-by-query analysis, full stemming achieved the highest improvement but also the biggest decrease in performance when compared to no stemming. In addition, statistical tests showed that the only significant improvement both in terms of mean average precision and precision at ten was achieved by our lighter stemmer.
Keywords: Information retrieval - stemming algoritms - evaluation
V.M. and C.R. Huyck, Portuguese-English
Cross-Language Information Retrieval Using Latent Semantic Indexing,
in Advances in Cross-Language Information Retrieval - Third Workshop of the
Cross-Language Evaluation Forum, CLEF 2002
(LNCS 2785), C. Peters, et al., Editors. 2003, Springer:
This paper reports the work of
Keywords: cross-language information retrieval, stemming algorithms, latent semantic indexing
and C.R. Huyck, A
Stemming Algorithm for the Portuguese Language, in 8th
International Symposium on String Processing and Information Retrieval (SPIRE).
Stemming algorithms are traditionally used in Information Retrieval with the goal of enhancing recall, as they conflate the variant forms of a word into a common representation. This paper describes the development of a simple and effective suffix-stripping algorithm for Portuguese. The stemmer is evaluated using a method proposed by Paice . The results show that it performs significantly better than the Portuguese version of the Porter algorithm.
Keywords: stemming algoritms, evaluation
V.M., Cross-Language Information Retrieval and
Digital Libraries, in C@MDX, C. Nielsen and V.M. Orengo, Editors. 2000,
· Moreira, V.P. and N. Edelweiss. Schema Versioning: Queries to the Generalised Temporal Database System. In: INTERNATIONAL WORKSHOP ON SPATIO-TEMPORAL DATA MODELS AND LANGUAGES. 1999. Florence, Italy: IEEE.
· Moreira, V.P. and N. Edelweiss, Queries to Temporal Databases Supporting Schema Versioning, in SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD'99). 1999: Florianópolis. p. 299-313
· Moreira, V.P. and N. Edelweiss, Versioning and The Generalised Temporal Database, in XXV CONFERENCIA LATINO-AMERICANA DE INFORMATICA - CLEI. 1999: Assuncion, Paraguai. p. 111-122
V.M., Assessing Relevance Using Automatically
Translated Documents for Cross-Language Information Retrieval,
in School of Computing Science. 2004,
· Moreira, V.P., Evolução de Esquemas em Bancos de Dados Temporais. 1997, UFRGS: Porto Alegre (Brazil). Trabalho Individual. (In Portuguese)
· Moreira, V.P., Consultas a Bancos de Dados Temporais que suportam Versionamento de Esquemas, in Instituto de Informática. 1999, UFRGS: Porto Alegre (Brazil) Masters Dissertation (in portuguese)
· Moreira, V.P. and N. Edelweiss, Consultas a bancos de dados temporais que suportam versionamento de esquemas, in SEMANA ACADEMICA DO CPGCC. 1998: Porto Alegre - RS-Brasil (in portuguese)