PLN UFRGS

Latest revision as of 23:36, 8 June 2020

The Brazilian Portuguese Web as Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.

Current version

The current corpus version, released in January 2017, is composed by 3.53 million documents, 2.68 billion tokens and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is now openly available, both for download and for navigation in a NoSketch Engine interface, and you can request access here. If you have any problems, please contact us directly through brwacteam at Gmail.

Classification according to readability levels

During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to readability levels by four different learning models with different characteristics. These annotations are also openly available through the brWaC access form above.

Publications

Main reference: Jorge Alberto Wagner Filho, Rodrigo Wilkens, Marco Idiart and Aline Villavicencio. The brWaC Corpus: A New Open Resource for Brazilian Portuguese. In 11th Language Resources and Evaluation Conference (LREC). (ResearchGate). 🇬🇧

Main reference for the readability annotations: Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. Automatic Construction of Large Readability Corpora. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). (PDF). 🇬🇧

Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. Crawling by readability level. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. (PDF). 🇬🇧

Jorge Alberto Wagner Filho. Coleta automática de corpora Web classificados por grau de legibilidade para o português. BSc. Final Project, UFRGS Digital Repository. (Lume). 🇧🇷

Selected works employing brWaC

If you have used our corpus in an interesting application, please cite our main reference on your publication or send us a message so that we can feature it here.

Zilio, L., Wilkens, R., & Fairon, C. (2018, September). PassPort: A Dependency Parsing Model for Portuguese. In International Conference on Computational Processing of the Portuguese Language (pp. 479-489). Springer, Cham.

Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., & Vieira, R. (2019, October). Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 437-442). IEEE.

Frazão, P. (2020, May). Idioma: A Document-Based Language-Learning Platform. B.A. Thesis, Princeton University. (GitHub).

Santos, J., Consoli, B., & Vieira, R. (2020, May). Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 4828-4834).

@@ Line 5: / Line 5: @@
-The current corpus version, released in January 2017, is composed by 3.53 million documents, '''2.68 billion tokens''' and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is now openly available, both for download and for navigation in a NoSketch Engine interface, and you can request access [https://goo.gl/forms/jrAe1HqcLJ39Zrou2 here]. If you have any problems, please contact us directly through brwacteam at gmail.
+The current corpus version, released in January 2017, is composed by 3.53 million documents, '''2.68 billion tokens''' and 5.79 million types (TTR 0.0021), and was completely parsed with the Palavras parser. It is now openly available, both for download and for navigation in a NoSketch Engine interface, and you can request access [https://goo.gl/forms/jrAe1HqcLJ39Zrou2 here]. If you have any problems, please contact us directly through brwacteam at Gmail.
 == Classification according to readability levels ==
-During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to readability levels by four different learning models with different characteristics. For more information about these experiments, please check our publications or contact us.
+During our studies, the whole corpus was annotated with 134 different textual attributes and classified according to readability levels by four different learning models with different characteristics. These annotations are also openly available through the brWaC access form above.
 == Publications ==
-* Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''	The brWaC Corpus: A New Open Resource to Aid in the Processing of Brazilian Portuguese'''. In 11th Language Resources and Evaluation Conference (LREC). (''to appear''). 🇬🇧 (Main reference)
+* '''Main reference:''' Jorge Alberto Wagner Filho, Rodrigo Wilkens, Marco Idiart and Aline Villavicencio. '''The brWaC Corpus: A New Open Resource for Brazilian Portuguese'''. In 11th Language Resources and Evaluation Conference (LREC). ([https://www.researchgate.net/publication/326303825_The_brWaC_Corpus_A_New_Open_Resource_for_Brazilian_Portuguese ResearchGate]). 🇬🇧
-* Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''Automatic Construction of Large Readability Corpora'''. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). ([http://www.inf.ufrgs.br/pln/wiki/images/9/97/ACoLRCinProc.pdf PDF]). 🇬🇧
+* '''Main reference for the readability annotations:''' Jorge Alberto Wagner Filho, Rodrigo Wilkens and Aline Villavicencio. '''Automatic Construction of Large Readability Corpora'''. In 1st Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). ([http://www.inf.ufrgs.br/pln/wiki/images/9/97/ACoLRCinProc.pdf PDF]). 🇬🇧
 * Jorge A. Wagner Filho, Rodrigo Wilkens, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. '''Crawling by readability level'''. In 12th International Conference on Computational Processing of the Portuguese Language (pp. 306-318). Springer International Publishing. ([http://www.inf.ufrgs.br/pln/wiki/images/2/26/CRL.pdf PDF]). 🇬🇧
 * Jorge Alberto Wagner Filho. '''Coleta automática de corpora Web classificados por grau de legibilidade para o português'''. BSc. Final Project, UFRGS Digital Repository. ([http://www.lume.ufrgs.br/handle/10183/147619 Lume]). 🇧🇷
+== Selected works employing brWaC ==
+If you have used our corpus in an interesting application, please cite our main reference on your publication or send us a message so that we can feature it here.
+* Zilio, L., Wilkens, R., & Fairon, C. (2018, September). '''PassPort: A Dependency Parsing Model for Portuguese'''. In International Conference on Computational Processing of the Portuguese Language (pp. 479-489). Springer, Cham.
+* Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., & Vieira, R. (2019, October). '''Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition'''. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 437-442). IEEE.
+* Frazão, P. (2020, May). '''Idioma: A Document-Based Language-Learning Platform'''. B.A. Thesis, Princeton University. ([https://github.com/paulo892/IdiomaFinal GitHub]).
+* Santos, J., Consoli, B., & Vieira, R. (2020, May). '''Word Embedding Evaluation in Downstream Tasks and Semantic Analogies'''. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 4828-4834).

PLN UFRGS

Difference between revisions of "BrWaC"

Latest revision as of 23:36, 8 June 2020

Contents

Current version

Classification according to readability levels

Publications

Selected works employing brWaC

Search

Personal tools

Website administration

Useful links