Viviane P. Moreira ’s Research Page

 

Research Interests

My research interests mix Data Mining, Information Retrieval, Natural Language Processing, Databases and the integration among these areas. More specifically, the topics in which I am currently working on are:

·         Data Mining of Electronic Health Records

·         Multimodal Information Retrieval

·         Multilingual Matching

·         Plagiarism Detection

·         Text Summarization

·         Opinion Mining

 

Previous Work:

Within Information Retrieval, my work was mainly on Cross-Language Information Retrieval, which is the retrieval of documents in one natural language, based on a query formulated in another natural language, e.g. retrieval of English documents based on a Spanish query.

I am also interested in Stemming Algorithms for Portuguese. As part of my PhD I have developed the RSLP Stemmer, which is freely accessible.

In Databases, my work was mainly with Similarity Functions and how to evaluate their performance.

For papers in those subjects please refer to my publications.

 

Research Projects (needs updating)

2012 - present

Multi-Match – multilingual matching
Multilingual information is available in many different sources and formats . This fact has motivated research aimed at finding mappings between data represented in different languages in the areas of Information Retrieval, Natural Language Processing, and more recently, Databases . This project focuses on researching and proposing new methods for matching of multilingual data in different scenarios. Our goals are:(i) collecting parallel corpora on the web , (ii) finding correspondences in multilingual Wikipedia, and (iii) detecting multilingual plagiarism. The results of this project will represent contributions to the fields of Information Retrieval and Natural Language Processing by providing parallel corpora that are very important resources for the advancement of these areas. The wide availability of multilingual data also facilitates plagiarism. Multilingual plagiarism detection will also be targeted in this project. In this topic, our main contribution will be the inclusion of analysis of citations and references that is essential for the confirmation of plagiarism.
Funded by CNPq (Edital Universal)

2010 - present

Cameleon
The goal of this project is to investigate, propose, experiment, apply and validate automatic and collaborative techniques for the development of lexical and ontological resources that can be useful in the context of multilingual applications, particularly for French, Portuguese and English.
Project HomePage: http://cameleon.imag.fr
Funded by CAPES
Coordinated by Aline Villavicencio

2010 - 2012

DP-ML Cross-Language Plagiarism Detection
With the dissemination of the Web, millions of people gain access to information from several areas of knowledge. The number of digital information available has grown immensely. However, despite its many benefits, the Web is one of the easiest means to enable plagiarism. Plagiarism is one of the most serious forms of academic misconduct. It is defined as “the use of another person's written work without acknowledging the source”. Recent research has shown that this type of misconduct is increasingly frequent in the academic world. This fact has been motivating techniques to automate plagiarism detection. This project focuses on cross-language plagiarism detection, in which the contents of a document are translated without making any reference to its source. Our aim is to develop an efficient method to enable the detection of cross-language plagiarism.
Funded by CNPq (Edital Universal)

2009 -

INCT Web
INWeb was created to study the various phenomena related to the Web. It is an institute composed by a network of researchers from four Brazilian Universities. The mission of this institute is to develop models, algorithms and technologies to contribute to the integration of the Web with the society. As a result, we expect more effective and secure distribution of information, more efficient and useful applications, so that the Web can become a vector for social and economic changes in our country. The institute activities include research, education of human resources and knowledge transfer to the society and companies. Our research proposal plans to improve the state-of-the-art for the three layers of networks. Specifically, we aim to develop solutions for three great challenges defined on the unified view of the Web: (i) Identification, characterization and modeling of interests and patterns of people behavior on the Web and the established networks among them; (ii) Treatment of information that circulates through the Web layers, considering the activities of crawling, extracting and processing information; (iii) Delivery of information in a satisfying way regardless of time and place.
Coordinated by Virgílio Almeida (UFMG).
Website: http://www.inweb.org.br/
Funded by: CNPq, MCT, and Fapemig

2008 - 2010

Cross-Language Information Retrieval
The aim of this project is to contribute to the development of  Cross-Language Information Retrieval (CLIR) involving Portuguese. The motivation is the growing need to explore documents in foreign  languages, experienced by more and more people. CLIR research has been developing quickly since the late 90's. Despite recent advances there are still many aspects left unexplored, specially regarding the use of Portuguese. In this project we hope to develop a system that accepts queries in Portuguese and searches for documents in English. In addition, the following aspects will be investigated: (i) development of stemming algorithms for Portuguese; (ii) proposal of new techniques for mapping concepts between languages through the analysis of parallel and comparable corpora; (iii) study of the process of relevance feedback in a CLIR environment; and (iv) development of techniques for the identification of multi-work expressions.
Funded by CNPq (Edital Universal)

2008 - Present

GPU Cluster

The aim of this project is to build a computer cluster based on Graphics Processing Units (GPUs) at the Institute of Informatics of UFRGS. The cluster consist in 6 computers with quad-core processors (4 CPUs), each connected via PCI-X to a external unit containing 4 GPUs. Thus, the cluster will have 24 CPUs and 24 GPUs conected by high speed Infiniband switches. Given that each CPU is composed internally by 4 processors, there will be effectively 3072 internal processors with a computational power of approximately 12 TFLOPS. The computational resource provided by this cluster will allow the processing of computationally complex tasks and will be vital for the research to be developed in the uniniversity in the next few years.

Funded by CNPq (Edital Jovens Pesquisadores)

Coordinated by João Luiz Dihl Comba

2007 - Present

ApproxMatch
Approximate Data Matching aims at deciding whether two data instances represent the same real world entity. This technique is employed in many data management applications, such as record deduplication, similarity querying, similarity joining and schema integration. This project aims at addressing three open problems in Approximate Data Matching: (i) defining adequate similarity functions for complex objects such as XML trees; (ii) developing quantitative measures to compare the quality of similarity functions; (iii) study how query decomposition methods should behave in environments where schema matching happens at query time.
Funded by  CNPq (Edital Universal)
Coordinated by Carlos A. Heuser

2007 - Present

Managing Large Volumes of Textual Data
This project is within the scope of the challenges set by the Brazilian Computer Society, namely the management of large volumes of multimedia distributed data. In the context of this challenge, this project deals specifically with the management of textual data, such as web pages or electronic documents, created by public or private organisations. One of the central problems is to establish relations and associations between documents. In this project, two types of relationships are considered (i) versioning of documents, aiming at determining groups of documents that can be considered as different versions of the same information; and (ii) content similarity, aiming at clustering documents that deal with the same subject.
Funded by CNPq (Edital Grandes Desafios)
Coordinated by J. Palazzo M. de Oliveira

2005 - Present

Integrating Information Retrieval Techniques into Database Systems
In the classical view, the areas of Information Retrieval (IR) and Databases (DB) have little in common. DB normally deal with structured data while IR deals with unstructured documents, typically in the form of free text. Considering that data stored by most organisations is both structured and unstructured, and that users frequently have to query data in both formats, there is a growing need to integrate the two areas. The aim of this project is to apply IR concepts into DB systems to facilitate the process of solving imprecise queries. The challenge it to modify the processing of queries to include the notions of similarity and relevance.
Funded by CAPES-PRODOC 

2003 - 2003

Cell Assemblies
Reverberating circuits of neurons can explain many psychological phenomena; as the neural representation of concepts, they may be the basis of thought. While evidence exists for neural Cell Assemblies (CAs), there has been very little work on the computational modelling of CAs. The goal of this project is to explore models of CAs, metrics and uses of CAs. My contribution was to adapt the CAs model to perform Information Retrieval.
Funded by EPSRC -England
Coordinated by Christian Huyck