Difference between revisions of "BrWaC"
(→Current version) |
(→Publications) |
||
| Line 19: | Line 19: | ||
* Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''Automatic Construction of Large Readability Corpora'''. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). ([http://www.inf.ufrgs.br/pln/wiki/images/9/97/ACoLRCinProc.pdf PDF]). 🇬🇧 | * Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''Automatic Construction of Large Readability Corpora'''. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). ([http://www.inf.ufrgs.br/pln/wiki/images/9/97/ACoLRCinProc.pdf PDF]). 🇬🇧 | ||
| − | * Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. '''Crawling by readability level'''. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. ([http:// | + | * Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. '''Crawling by readability level'''. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. ([http://www.inf.ufrgs.br/pln/wiki/images/2/26/CRL.pdf PDF]). 🇬🇧 |
* Jorge Alberto Wagner Filho. '''Coleta automática de corpora Web classificados por grau de legibilidade para o português'''. BSc. Final Project, UFRGS Digital Repository. ([http://www.lume.ufrgs.br/handle/10183/147619 Lume]). 🇧🇷 | * Jorge Alberto Wagner Filho. '''Coleta automática de corpora Web classificados por grau de legibilidade para o português'''. BSc. Final Project, UFRGS Digital Repository. ([http://www.lume.ufrgs.br/handle/10183/147619 Lume]). 🇧🇷 | ||
Revision as of 16:02, 25 March 2017
The Brazilian Portuguese Web as Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
Current version
The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is openly available in our server trough the NoSketch Engine platform (no syntactic annotation), and also for complete download in different formats. If you have any access problems, please contact us directly.
Classification according to readability levels
During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to readability levels by four different learning models with different characteristics. For more information about these experiments, please check our publications or contact us.
Publications
- Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. Automatic Construction of Large Readability Corpora. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). (PDF). 🇬🇧
- Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. Crawling by readability level. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. (PDF). 🇬🇧
- Jorge Alberto Wagner Filho. Coleta automática de corpora Web classificados por grau de legibilidade para o português. BSc. Final Project, UFRGS Digital Repository. (Lume). 🇧🇷