|
|
Viviane P. Moreira – Publicações |
·
Thanh
Nguyen, Viviane Moreira, Huong Nguyen,
Abstract:
Recent research has taken advantage of Wikipedia’s
multilingualism as a resource for cross-language information retrieval and
machine translation, as well as proposed techniques for enriching its
cross-language structure. The availability of documents in multiple languages
also opens up new opportunities for querying structured Wikipedia content, and
in particular, to enable answers that straddle different languages. As a step
towards supporting such queries, in this paper, we propose a method for
identifying mappings between attributes from infoboxes that come from pages in
different languages. Our approach finds mappings in a completely automated
fashion. Because it does not require training data, it is scalable: not only
can it be used to find mappings between many language pairs, but it is also effective
for languages that are under-represented and lack sufficient training samples.
Another important benefit of our approach is that it does not depend on
syntactic similarity between attribute names, and thus, it can be applied to
language pairs that have distinct morphologies. We have performed an extensive
experimental evaluation using a corpus consisting of pages in Portuguese,
Vietnamese, and English. The results show that not only does our approach obtain
high precision and recall, but it also outperforms state-of-the-art techniques.
We also present a case study which demonstrates that the multilingual mappings
we derive lead to substantial improvements in answer quality and coverage for
structured queries over Wikipedia content.
· dos Santos, J.B., C.A. Heuser, V.P. Moreira, and L.K. Wives, Automatic threshold estimation for data matching applications. Information Sciences. v. 181, p. 2685-2699, 2011. Link para o artigo
Abstract:
Several advanced data management applications, such as
data integration, data deduplication, and similarity querying rely on the
application of similarity functions. A similarity function requires the
definition of a threshold value in order to decide whether two different data
instances match, i.e., if they represent the same real world object. In this
context, threshold definition is a central problem. This paper proposes a
method for estimating the quality of a similarity function. Quality is measured
in terms of recall and precision calculated at several different thresholds.
Based on the results of the proposed estimation process and the requirements of
a specific application, a user is able to choose a suitable threshold value.
The estimation process is based on a clustering phase performed over a data
collection (or a sample thereof) and requires no human intervention since the
choice of similarity threshold is based on the silhouette coefficient, which is
an internal quality measure for clusters. An extensive set of experiments on
artificial and real datasets demonstrates the effectiveness of the proposed
approach. The results of the experiments show that in most cases the estimation
error was below 10% in terms of precision and recall.
· Dorneles, Carina F. ; Nunes, Marcos Freitas; Heuser, Carlos A. ; Moreira, Viviane P. ; da Silva, Altigran S. ; de Moura, Edleno S. . A strategy for allowing meaningful and comparable scores in approximate matching. Information Systems , v. 34, p. 673-689, 2009. Link para o artigo
Abstract:
Approximate data matching aims at assessing whether
two distinct instances of data represent the same real-world object. The
comparison between data values is usually done by applying a similarity function
which returns a similarity score. If this score surpasses a given threshold,
both data instances are considered as representing the same real-world object.
These score values depend on the algorithm that implements the function and
have no meaning to the user. In addition, score values generated by different
functions are not comparable. This will potentially lead to problems when the
scores returned by different similarity functions need to be combined for
computing the similarity between records. In this article, we propose that
thresholds should be defined in terms of the precision that is expected from
the matching process rather than in terms of the raw scores returned by the
similarity function. Precision is a widely known similarity metric and has a
clear interpretation from the user's point of view. Our approach defines
mappings from score values to precision values, which we call adjusted scores.
In order to obtain such mappings, our approach requires training over a small
dataset. Experiments show that training can be reused for different datasets on
the same domain. Our results also demonstrate that existing methods for
combining scores for computing the similarity between records may be enhanced
if adjusted scores are used.
Keywords: Similarity querying; Data integration; Data cleaning;
Entity resolution; Deduplication
· RITT, M ; COSTA, A ; MERGEN, S ; ORENGO, V. M. An integer linear programming approach for approximate string comparison. European Journal of Operational Research, 2009 Link para o artigo
Abstract:
We introduce a problem called maximum common
characters in blocks (MCCB), which arises in applications of approximate string
comparison, particularly in the unification of possibly erroneous textual data
coming from different sources. We show that this problem is NP-complete, but
can nevertheless be solved satisfactorily using integer linear programming for
instances of practical interest. Two integer linear formulations are proposed
and compared in terms of their linear relaxations. We also compare the results
of the approximate matching with other known measures such as the Levenshtein
(edit) distance.
Keywords: Approximate string matching; Distance metric; Block edits; Block moves;
Integer programming; Hop constraints
·
Silva, R.; Stasiu, R.K.; Orengo, V.M. and
Abstract:
This paper presents a method for assessing the quality
of similarity functions. The scenario taken into account is that of approximate
data matching, in which it is necessary to determine whether two data instances
represent the same real world object. Our method is based on the semi-automatic
estimation of optimal threshold values. We propose two methods for performing
such estimation. The first method is an algorithm based on a reward function,
and the second is a statistical method. Experiments were carried out to
validate the techniques proposed. The results show that both methods for
threshold estimation produce similar results. The output of such methods was
used to design a grading function for similarity functions. This grading
function, called discernability, was used to compare a number of similarity
functions applied to an experimental data set.
Keywords: Approximate data matching; Similarity functions; Retrieval evaluation
· Orengo, V.M. and C.R. Huyck, Relevance Feedback and Cross-language Information Retrieval. Information Processing & Management, 2006. 42(5): p.1203-1217. Link para o artigo
Abstract:
This paper presents a study of relevance feedback in a
cross-language information retrieval environment. We have performed an
experiment in which Portuguese speakers are asked to judge the relevance of
English documents; documents hand-translated to Portuguese and documents
automatically translated to Portuguese. The goals of the experiment were to
answer two questions (i) how well can native Portuguese searchers recognise
relevant documents written in English, compared to documents that are hand
translated and automatically translated to Portuguese; and (ii) what is the
impact of misjudged documents on the performance improvement that can be
achieved by relevance feedback. Surprisingly, the results show that machine
translation is as effective as hand translation in aiding users to assess
relevance in the experiment. In addition, the impact of misjudged documents on
the performance of RF is overall just moderate, and varies greatly for
different query topics.
Keywords: Cross-language information retrieval; Relevance feedback
· Huyck, C.R. and V.M. Orengo, Information Retrieval and Categorisation Using a Cell Assembly Network. Neural Computing and Applications, 2005. 14(4): p. 282-289. Link para o artigo
Abstract:
Simulated networks of spiking leaky integrators are
used to categorise and for Information Retrieval (IR). Neurons in the network
are sparsely connected, learn using Hebbian learning rules, and are simulated
in discrete time steps. Our earlier work has used these models to simulate
human concept formation and usage, but we were interested in the model's
applicability to real world problems, so we have done experiments on
categorisation and IR. The results of the system show that congresspeople are
correctly categorised 89% of the time. The IR systems have 40% average
precision on the Time collection, and 28% on the Cranfield 1,400. All scores
are comparable to the state of the art results on these tasks.
Keywords: Information
retrieval - Categorisation - Neural
network - Cell assembly - Hebbian learning
· Orengo, V.M. and D. Santos, Diana. Radicalizadores versus analisadores morfológicos: Sobre a participação do Removedor de Sufixos da Língua Portuguesa nas Morfolimpíadas, in Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa. Lisboa, D. Santos, Editor. 2007, IST Press: Lisboa. ISBN:978-972-8469-60-8
·
VOLPE,
Abstract:
One of the main tasks in Information Retrieval is to
match a user query to the documents that are relevant for it. This matching is challenging because in many
cases the keywords the user chooses will be different from the words the
authors of the relevant documents have used. Throughout the years, many
approaches have been proposed to deal with this problem. One of the most
popular consists in expanding the query with related terms with the goal of
retrieving more relevant documents. In
this paper, we propose a new method in which a Cell Assembly model is applied
for query expansion. Cell Assemblies are
reverberating circuits of neurons that can persist long beyond the initial
stimulus has ceased. They learn through Hebbian Learning rules and have been
used to simulate the formation and the usage of human concepts. We adapted the
Cell Assembly model to learn relationships between the terms in a document
collection. These relationships are then used to augment the original queries.
Our experiments use standard Information Retrieval test collections and show
that some queries significantly improved their results with our technique.
·
ACOSTA, O. C. ;
Abstract:
The extensive use of Multiword Expressions (MWE) in
natural language texts prompts more detailed studies that aim for a more
adequate treatment of these expressions. A MWE typically expresses concepts and
ideas that usually cannot be expressed by a single word. Intuitively, with the
appropriate treatment of MWEs, the results of an Information Retrieval (IR)
system could be improved. The aim of this paper is to apply techniques for the
automatic extraction of MWEs from corpora to index them as a single unit.
Experimental results show improvements on the retrieval of relevant documents
when identifying MWEs and treating them as a single indexing unit.
·
PEREIRA,
R. C. ; MOREIRA,V.P. ; GALANTE, R. . A New
Approach for Cross-Language Plagiarism Analysis. In: CLEF 2010 Conference on Multilingual and Multimodal
Information Access Evaluation, 2010,
Abstract:
This paper presents a new method for Cross-Language
Plagiarism Analysis. Our task is to detect the plagiarized passages in the
suspicious documents and their corresponding fragments in the source documents.
We propose a plagiarism detection method composed by five main phases: language
normalization, retrieval of candidate documents, classifier training,
plagiarism analysis, and post-processing. To evaluate our method, we created a
corpus containing artificial plagiarism offenses. Two different experiments
were conducted; the first one considers only monolingual plagiarism cases,
while the second one considers only cross-language plagiarism cases. The
results showed that the cross-language experiment achieved 86% of the
performance of the monolingual baseline. We also analyzed how the plagiarized
text length affects the overall performance of the method. This analysis showed
that our method achieved better results with medium and large plagiarized
passages.
·
PEREIRA,
R. C. ; MOREIRA,V.P. ; GALANTE, R. . UFRGS@PAN2010:
Detecting External Plagiarism. In: PAN
2010 Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse,,
2010,
Abstract:
This paper presents our approach to detect plagiarism
in the PAN’10 competition. To accomplish this task we applied a method which
aims at detecting external plagiarism cases. The method is specially designed
to detect cross-language plagiarism and is composed by five phases: language
normalization, retrieval of candidate documents, classifier training,
plagiarism analysis, and post-processing. Our group got the seventh place in
the competition with an overall score of 0.5175. It is important to notice that
the final score was affected by our low recall (0.4036) which arose as a result
of not detecting intrinsic plagiarism cases, which were also present in the
competition corpus.
·
FLORES, F.; MOREIRA Viviane P.; Heuser, C.
A. Assessing the Impact of Stemming Accuracy
on Information Retrieval. In: International Conference on
Computational Processing of Portuguese Language (PROPOR 2010), 2010 p. 11-20. Link para o
artigo
Abstract:
The quality of stemming algorithms is typically
measured in two different ways: (i) how accurately they map the variant forms
of a word to the same stem; or (ii) how much improvement they bring to
Information Retrieval. In this paper, we evaluate different Portuguese stemming
algorithms in terms of accuracy and in terms of their aid to Information
Retrieval. The aim is to assess whether the most accurate stemmers are also the
ones that bring the biggest gain in Information Retrieval. Our results show
that some kind of correlation does exist, but it is not as strong as one might
have expected.
·
GERALDO, A. P. ; MOREIRA, Viviane P. ;
GONCALVES, M. A. . On-demand Associative
Cross-Language Information Retrieval. In: SPIRE, 2009,
Saariselka. Proceedings of SPIRE 2009 (LNCS 5721), 2009. p. 165-173. Link para o artigo
Abstract:
This paper proposes the use of algorithms for mining
association rules as an approach for Cross-Language Information Retrieval. These
algorithms have been widely used to analyse market basket data. The idea is to
map the problem of finding associations between sales items to the problem of
finding term translations over a parallel corpus. The proposal was validated by
means of experiments using queries in two distinct languages: Portuguese and
Finnish to retrieve documents in English. The results show that the performance
of our proposed approach is comparable to the performance of the monolingual
baseline and to query translation via machine translation, even though these
systems employ more complex Natural Language Processing techniques. The
combination between machine translation and our approach yielded the best
results, even outperforming the monolingual baseline.
Keywords: association rules, experimentation, performance measurement
·
GERALDO, A. P. ; ORENGO, Viviane Moreira . UFRGS@CLEF2008:
Using Association Rules for Cross-Language Information Retrieval. In:
Cross-Language Evaluation Forum (CLEF), 2008,
Abstract:
For UFRGS’s participation on the TEL task at CLEF2008,
our aim was to assess the validity of using algorithms for mining association
rules to find mappings between concepts on a Cross-Language Information
Retrieval scenario. Our approach requires a sample of parallel documents to
serve as the basis for the generation of the association rules. The results of
the experiments show that the performance of our approach is not statistically
different from the monolingual baseline in terms of mean average precision.
This is an indication that association rules can be effectively used to map
concepts between languages. We have also tested a modification to BM25 that
aims at increasing the weight of rare terms. The results show that this
modified version achieved better performance. The improvements were considered
to be statistically significant in terms of MAP on our monolingual runs.
Keywords: association
rules, experimentation, performance measurement
·
ACOSTA, O. C. ; ORENGO, Viviane Moreira ;
Abstract:
For UFRGS’s participation on CLEF’s Robust task, our
aim was to assess the benefits of identifying and indexing Multiword
Expressions (MWEs) for Information Retrieval. The approach used for MWE
identification was totally statistical, based association measures such as
Mutual Information and Chi-square. Contradicting our results on the training topics,
the results on the test topics did not show any significant improvements.
However, for some queries, the identification of MWEs was very important. We
have also performed bilingual experiments which achieved 84% of their
monolingual counterparts
Keywords: experimentation,
performance measurement, multiword expression
· SANTOS, J. B. ; HEUSER, Carlos Alberto ; ORENGO, Viviane Moreira ; WIVES, L. K. Automatic Threshold Estimation for Data Matching Applications. In: Simpósio Brasileiro de Banco de Dados, 2008, Campinas. Anais do SBBD 2008, 2008. Link para o artigo
Abstract:
Several advanced data management applications, such as
data integration, data deduplication or similarity querying rely on the
application of similarity functions. A similarity function requires the
definition of a threshold value in order to assess if two different data
instances match, i.e., if they represent the same real world object. In this
context, the threshold definition is a central problem. In this paper, we
propose a method for the estimation of the quality of a similarity function.
Quality is measured in terms of recall and precision calculated at several
different thresholds. On the basis of the results of the proposed estimation
process, and taking into account the requirements of a specific application, a
user is able to choose a threshold value that is adequate for the application.
The proposed estimation process is based on a clustering phase performed on a
sample taken from a data collection and requires no human intervention.
Keywords: similarity querying, clustering
·
Dorneles,
C. F. ; Heuser, Carlos Alberto ; Orengo, Viviane Moreira ; Silva, A. S. ;
Moura, E. S. . A Strategy for Allowing Meaningful and
Comparable Scores in Approximate Matching. In: CIKM - Conference on Information and Knowledge
Management, 2007,
Abstract:
The goal of approximate data matching is to assess
whether two distinct data instances represent the same real world object. This
is usually achieved through the use of a similarity function, which returns a
score that defines how similar two data instances are. If this score surpasses
a given threshold, both data instances are considered as representing the same
real world object. The score values returned by a similarity function depend on
the algorithm that implements the function and have no meaning to the user
(apart from the fact that a higher similarity value means that two data
instances are more similar). In this paper, we propose that instead of defining
the threshold in terms of the scores returned by a similarity function, the
user specifies the precision that is expected from the matching process.
Precision is a well known quality measure and has a clear interpretation from
the user's point of view. Our approach relies on mapping between similarity
scores and precision values based on a training data set. Experimental results
show the training may be executed against a representative data set, and reused
for other databases from the same domain.
Keywords: data cleaning - data integration - similarity querying
· Heuser, Carlos Alberto ; Krieser, F. A. ; Orengo, Viviane Moreira . SimEval - A Tool for Evaluating the Quality of Similarity Functions. In: Tutorials, posters, panels and industrial contributions at the 26th International Conference on Conceptual Modeling - ER 2007. Auckland : ACS, 2007. v. 83. p. 71-76. Link para o artigo
Abstract:
Approximate data matching applications typically use
similarity functions to quantify the degree of likeness between two data
instances. There are several similarity functions available, thus, it is often
necessary to evaluate a number of them aiming at choosing the function that is
more adequate to a specific application. This paper presents a tool that uses
average precision and discernability to evaluate the quality ofsimilarity
functions over a data set.
Keywords: approximate data matching - similarity functions
·
Orengo, V.M, L. Buriol,
and A. Coelho. A study on the use of Stemming for Monolingual Ad-Hoc
Portuguese Information Retrieval , in Evaluation
of Multilingual and Multi-modal Information Retrieval, C. Peters, et al.,
Editors. 2007, Springer
Abstract:
For UFRGS’s first participation on CLEF our goal was
to compare the performance of heavier and lighter stemming strategies using the
Portuguese data collections for Monolingual Ad-hoc retrieval. The results show
that the safest strategy was to use the lighter alternative (reducing plural
forms only). On a query-by-query analysis, full stemming achieved the highest
improvement but also the biggest decrease in performance when compared to no
stemming. In addition, statistical tests showed that the only significant
improvement both in terms of mean average precision and precision at ten was
achieved by our lighter stemmer.
Keywords: Information retrieval - stemming algoritms - evaluation
·
Orengo, V.M. and C.R. Huyck, Portuguese-English
Cross-Language Information Retrieval Using Latent Semantic Indexing, in Advances
in Cross-Language Information Retrieval - Third Workshop of the Cross-Language
Evaluation Forum, CLEF 2002 (LNCS 2785),
C. Peters, et al., Editors. 2003, Springer:
Abstract:
This paper reports the work of
Keywords: cross-language
information retrieval, stemming algorithms, latent semantic indexing
·
Orengo, V.M. and C.R. Huyck, A
Stemming Algorithm for the Portuguese Language, in 8th International
Symposium on String Processing and Information Retrieval (SPIRE). 2001:
Abstract:
Stemming algorithms are traditionally used in
Information Retrieval with the goal of enhancing recall, as they conflate the
variant forms of a word into a common representation. This paper describes the
development of a simple and effective suffix-stripping algorithm for
Portuguese. The stemmer is evaluated using a method proposed by Paice [9]. The
results show that it performs significantly better than the Portuguese version
of the Porter algorithm.
Keywords: stemming algoritms, evaluation
·
Orengo, V.M., Cross-Language
Information Retrieval and Digital Libraries, in C@MDX, C. Nielsen
and V.M. Orengo, Editors. 2000,
· Moreira, V.P. and N. Edelweiss. Schema Versioning: Queries to the Generalised Temporal Database System. In: INTERNATIONAL WORKSHOP ON SPATIO-TEMPORAL DATA MODELS AND LANGUAGES. 1999. Florence, Italy: IEEE.
· Moreira, V.P. and N. Edelweiss, Queries to Temporal Databases Supporting Schema Versioning, in SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD'99). 1999: Florianópolis. p. 299-313
· Moreira, V.P. and N. Edelweiss, Versioning and The Generalised Temporal Database, in XXV CONFERENCIA LATINO-AMERICANA DE INFORMATICA - CLEI. 1999: Assuncion, Paraguai. p. 111-122
·
Orengo, V.M., Assessing
Relevance Using Automatically Translated Documents for Cross-Language
Information Retrieval, in School of Computing Science. 2004,
· Moreira, V.P., Evolução de Esquemas em Bancos de Dados Temporais. 1997, UFRGS: Porto Alegre (Brazil). Trabalho Individual. (In Portuguese)
· Moreira, V.P., Consultas a Bancos de Dados Temporais que suportam Versionamento de Esquemas, in Instituto de Informática. 1999, UFRGS: Porto Alegre (Brazil) Masters Dissertation (in portuguese)
· Moreira, V.P. and N. Edelweiss, Consultas a bancos de dados temporais que suportam versionamento de esquemas, in SEMANA ACADEMICA DO CPGCC. 1998: Porto Alegre - RS-Brasil (in portuguese)