BSC-LT
/

roberta-base-ca

@@ -111,75 +111,7 @@ It contains the following tasks and their related datasets:
  3. Text Classification (TC)
-    **[TeCla](---
-language: "ca"
-tags:
-- masked-lm
-- BERTa
-- catalan
-license: apache-2.0
----
-# BERTa: RoBERTa-based Catalan language model
-## BibTeX  citation
-If you use any of these resources (datasets or models) in your work, please cite our latest paper:
-```bibtex
-@inproceedings{armengol-estape-etal-2021-multilingual,
-    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
-    author = "Armengol-Estap{\'e}, Jordi  and
-      Carrino, Casimiro Pio  and
-      Rodriguez-Penagos, Carlos  and
-      de Gibert Bonet, Ona  and
-      Armentano-Oller, Carme  and
-      Gonzalez-Agirre, Aitor  and
-      Melero, Maite  and
-      Villegas, Marta",
-    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
-    month = aug,
-    year = "2021",
-    address = "Online",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2021.findings-acl.437",
-    doi = "10.18653/v1/2021.findings-acl.437",
-    pages = "4933--4946",
-}
-```
-## Model description
-BERTa is a transformer-based masked language model for the Catalan language.
-It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
-and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
-## Training corpora and preprocessing
-The training corpus consists of several corpora gathered from web crawling and public corpora.
-The publicly available corpora are:
- 1. the Catalan part of the [DOGC](http://opus.nlpl.eu/DOGC-v2.php) corpus, a set of documents from the Official Gazette of the Catalan Government
- 2. the [Catalan Open Subtitles](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.ca.gz), a collection of translated movie subtitles
- 3. the non-shuffled version of the Catalan part of the [OSCAR](https://traces1.inria.fr/oscar/) corpus \\\\cite{suarez2019asynchronous},
-    a collection of monolingual corpora, filtered from [Common Crawl](https://commoncrawl.org/about/)
- 4. The [CaWac](http://nlp.ffzg.hr/resources/corpora/cawac/) corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013
-    the non-deduplicated version
- 5. the [Catalan Wikipedia articles](https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/cawiki/20200801/) downloaded on 18-08-2020.
-The crawled corpora are:
- 6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains
- 7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government
- 8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the [Catalan News Agency](https://www.acn.cat/)
-https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
  4. Semantic Textual Similarity (STS)

  3. Text Classification (TC)
+    **[TeCla](https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
  4. Semantic Textual Similarity (STS)