Luciano Barbosa, pesquisador da IBM, ministrará palestra intitulada “Crawling Parallel Corpora on the Web”, no dia 08 de outubro, terça-feira, às 15h30min, no Auditório Castilho, do Instituto de Informática.
Título: Crawling Parallel Corpora on the Web
Data: 8/10 (terça-feira)
Horário: 15:30
Local: Auditório Castilho (verde)
Abstract:
Parallel texts are translations of the same text in different languages. Parallel text acquisition from the Web has received increased attention in recent years, especially for machine translation and cross-language information retrieval. For many years, the European Parliament proceedings and official documents of countries with multiple languages were the only widely available parallel texts. Although these are high-quality corpora, they have some limitations: (1) they tend to be domain specific (e.g., government-related texts); (2) they are available in only a few languages; and (3) sometimes they are not free or there is some restriction on using them. On the other hand, Web data is free and comprises data from different languages and topics. In this talk, I will present a web crawler specialized in collecting parallel corpora from the Web. First, it locates sites that contain multilingual content (multilingual sites), by constraining its visitation policy to the graph neighborhood of these sites. Subsequently, it uses a novel recursive mining technique to extract parallel texts within these sites. A sub-task in the problem of multilingual site discovery is the job of detecting multilingual sites. I will then introduce the MultiSite Detector that performs this task with limited supervision, high precision and low cost.
Bio:
Luciano Barbosa is a Research Scientist at IBM Research – Brazil. Previously (2009-2013), he worked as a Research Scientist at AT&T Labs – Research. He obtained his Ph.D. in Computing at University of Utah (2009), and his M.S. (2003) and B.S. (2000) in Computer Science at Universidade Federal de Pernambuco. From 1999 to 2002, he worked as a lead developer at RADIX, one of the first Brazilian search engines. His research interests include web mining, text mining and NLP.