Difference between revisions of "BrWaC"

Line 1: Line 1:
The Brazilian Portuguese Web as Corpus is a corpus constructed in our lab following the Wacky framework. This page will be updated soon to include information about how to access the resource. Meanwhile, please [[Contact|contact us]] directly.
+
The Brazilian Portuguese Web as Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
  
 +
== Current version ==
  
== Current version ==
+
The current corpus version, released in January 2017, is composed by 3.53 million documents, '''2.68 billion tokens''' and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is available trough the NoSketch Engine platform in our server, accessible in http://nlpserver2.inf.ufrgs.br/brwac/ . For other forms of access, please [[Contact|contact us]] directly.
  
  
Line 8: Line 9:
 
== Classification according to readability levels ==
 
== Classification according to readability levels ==
  
 +
During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to reading levels by four different learning models with different characteristics. For more information about these experiments, please check our publications or contact us.
 +
 +
 +
== Publications ==
  
 +
* Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''Automatic Construction of Large Readability Corpora'''. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). ([http://www.inf.ufrgs.br/pln/wiki/images/9/97/ACoLRCinProc.pdf PDF]).
  
== Papers ==
+
* Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. '''Crawling by readability level'''. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. ([http://link.springer.com/chapter/10.1007/978-3-319-41552-9_31 Springer]).

Revision as of 11:18, 18 February 2017

The Brazilian Portuguese Web as Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.

Current version

The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is available trough the NoSketch Engine platform in our server, accessible in http://nlpserver2.inf.ufrgs.br/brwac/ . For other forms of access, please contact us directly.


Classification according to readability levels

During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to reading levels by four different learning models with different characteristics. For more information about these experiments, please check our publications or contact us.


Publications

  • Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. Automatic Construction of Large Readability Corpora. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). (PDF).
  • Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. Crawling by readability level. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. (Springer).