bsc-temu commited on
Commit ·
eb8ceb7
1
Parent(s): 78ad185
remove repeated text readme
Browse files
README.md
CHANGED
|
@@ -111,75 +111,7 @@ It contains the following tasks and their related datasets:
|
|
| 111 |
|
| 112 |
3. Text Classification (TC)
|
| 113 |
|
| 114 |
-
**[TeCla](
|
| 115 |
-
language: "ca"
|
| 116 |
-
tags:
|
| 117 |
-
- masked-lm
|
| 118 |
-
- BERTa
|
| 119 |
-
- catalan
|
| 120 |
-
license: apache-2.0
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
# BERTa: RoBERTa-based Catalan language model
|
| 124 |
-
|
| 125 |
-
## BibTeX citation
|
| 126 |
-
|
| 127 |
-
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
| 128 |
-
|
| 129 |
-
```bibtex
|
| 130 |
-
@inproceedings{armengol-estape-etal-2021-multilingual,
|
| 131 |
-
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
|
| 132 |
-
author = "Armengol-Estap{\'e}, Jordi and
|
| 133 |
-
Carrino, Casimiro Pio and
|
| 134 |
-
Rodriguez-Penagos, Carlos and
|
| 135 |
-
de Gibert Bonet, Ona and
|
| 136 |
-
Armentano-Oller, Carme and
|
| 137 |
-
Gonzalez-Agirre, Aitor and
|
| 138 |
-
Melero, Maite and
|
| 139 |
-
Villegas, Marta",
|
| 140 |
-
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
|
| 141 |
-
month = aug,
|
| 142 |
-
year = "2021",
|
| 143 |
-
address = "Online",
|
| 144 |
-
publisher = "Association for Computational Linguistics",
|
| 145 |
-
url = "https://aclanthology.org/2021.findings-acl.437",
|
| 146 |
-
doi = "10.18653/v1/2021.findings-acl.437",
|
| 147 |
-
pages = "4933--4946",
|
| 148 |
-
}
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
## Model description
|
| 153 |
-
|
| 154 |
-
BERTa is a transformer-based masked language model for the Catalan language.
|
| 155 |
-
It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
|
| 156 |
-
and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
|
| 157 |
-
|
| 158 |
-
## Training corpora and preprocessing
|
| 159 |
-
|
| 160 |
-
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
| 161 |
-
|
| 162 |
-
The publicly available corpora are:
|
| 163 |
-
|
| 164 |
-
1. the Catalan part of the [DOGC](http://opus.nlpl.eu/DOGC-v2.php) corpus, a set of documents from the Official Gazette of the Catalan Government
|
| 165 |
-
|
| 166 |
-
2. the [Catalan Open Subtitles](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.ca.gz), a collection of translated movie subtitles
|
| 167 |
-
|
| 168 |
-
3. the non-shuffled version of the Catalan part of the [OSCAR](https://traces1.inria.fr/oscar/) corpus \\\\cite{suarez2019asynchronous},
|
| 169 |
-
a collection of monolingual corpora, filtered from [Common Crawl](https://commoncrawl.org/about/)
|
| 170 |
-
|
| 171 |
-
4. The [CaWac](http://nlp.ffzg.hr/resources/corpora/cawac/) corpus, a web corpus of Catalan built from the .cat top-level-domain in late 2013
|
| 172 |
-
the non-deduplicated version
|
| 173 |
-
|
| 174 |
-
5. the [Catalan Wikipedia articles](https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/cawiki/20200801/) downloaded on 18-08-2020.
|
| 175 |
-
|
| 176 |
-
The crawled corpora are:
|
| 177 |
-
|
| 178 |
-
6. The Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains
|
| 179 |
-
7. the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government
|
| 180 |
-
|
| 181 |
-
8. the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the [Catalan News Agency](https://www.acn.cat/)
|
| 182 |
-
https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
|
| 183 |
|
| 184 |
4. Semantic Textual Similarity (STS)
|
| 185 |
|
|
|
|
| 111 |
|
| 112 |
3. Text Classification (TC)
|
| 113 |
|
| 114 |
+
**[TeCla](https://doi.org/10.5281/zenodo.4627197)**: consisting of 137k news pieces from the Catalan News Agency ([ACN](https://www.acn.cat/)) corpus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
4. Semantic Textual Similarity (STS)
|
| 117 |
|