Update README.md
Browse files
README.md
CHANGED
|
@@ -32,6 +32,9 @@ widget:
|
|
| 32 |
- [Training data](#training-data)
|
| 33 |
- [Training procedure](#training-procedure)
|
| 34 |
- [Evaluation](#evaluation)
|
|
|
|
|
|
|
|
|
|
| 35 |
- [Additional information](#additional-information)
|
| 36 |
- [Authors](#authors)
|
| 37 |
- [Contact information](#contact-information)
|
|
@@ -56,7 +59,7 @@ This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://
|
|
| 56 |
|
| 57 |
The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average, it is twice as fast as its teacher.
|
| 58 |
|
| 59 |
-
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the training and evaluation data.
|
| 60 |
|
| 61 |
## Intended uses and limitations
|
| 62 |
|
|
@@ -81,7 +84,8 @@ At the time of submission, no measures have been taken to estimate the bias embe
|
|
| 81 |
### Training data
|
| 82 |
|
| 83 |
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
| 84 |
-
|
|
|
|
| 85 |
| Corpus | Size in GB |
|
| 86 |
|-------------------------|------------|
|
| 87 |
| Catalan Crawling | 13.00 |
|
|
@@ -98,6 +102,7 @@ The training corpus consists of several corpora gathered from web crawling and p
|
|
| 98 |
| Nació Digital | 0.42 |
|
| 99 |
| Vilaweb | 0.06 |
|
| 100 |
| Tweets | 0.02 |
|
|
|
|
| 101 |
|
| 102 |
### Training procedure
|
| 103 |
|
|
@@ -115,13 +120,30 @@ As an example, the distilled version of BERT has 40% fewer parameters and runs 6
|
|
| 115 |
|
| 116 |
[TODO]
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
### Evaluation results
|
| 119 |
|
| 120 |
-
This
|
| 121 |
|
| 122 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 123 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
| 124 |
-
| RoBERTa-large-ca-v2
|
| 125 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
|
| 126 |
| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
|
| 127 |
|
|
|
|
| 32 |
- [Training data](#training-data)
|
| 33 |
- [Training procedure](#training-procedure)
|
| 34 |
- [Evaluation](#evaluation)
|
| 35 |
+
- [Variable and metrics](#variable-and-metrics)
|
| 36 |
+
- [Evaluation benchmark](#evaluation-benchmark)
|
| 37 |
+
- [Evaluation results](#evaluation-results)
|
| 38 |
- [Additional information](#additional-information)
|
| 39 |
- [Authors](#authors)
|
| 40 |
- [Contact information](#contact-information)
|
|
|
|
| 59 |
|
| 60 |
The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average, it is twice as fast as its teacher.
|
| 61 |
|
| 62 |
+
We encourage users of this model to check out the [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model card to learn more details about the teacher model, as well as the training and evaluation data.
|
| 63 |
|
| 64 |
## Intended uses and limitations
|
| 65 |
|
|
|
|
| 84 |
### Training data
|
| 85 |
|
| 86 |
The training corpus consists of several corpora gathered from web crawling and public corpora.
|
| 87 |
+
<details>
|
| 88 |
+
<summary>Click to expand</summary>
|
| 89 |
| Corpus | Size in GB |
|
| 90 |
|-------------------------|------------|
|
| 91 |
| Catalan Crawling | 13.00 |
|
|
|
|
| 102 |
| Nació Digital | 0.42 |
|
| 103 |
| Vilaweb | 0.06 |
|
| 104 |
| Tweets | 0.02 |
|
| 105 |
+
</details>
|
| 106 |
|
| 107 |
### Training procedure
|
| 108 |
|
|
|
|
| 120 |
|
| 121 |
[TODO]
|
| 122 |
|
| 123 |
+
### Evaluation benchmark
|
| 124 |
+
|
| 125 |
+
This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB).
|
| 126 |
+
|
| 127 |
+
Here are the train/dev/test splits of each dataset:
|
| 128 |
+
|
| 129 |
+
| Dataset | Task | Total | Train | Dev | Test |
|
| 130 |
+
|:--|:--|:--|:--|:--|:--|
|
| 131 |
+
| Ancora | NER |13,581 | 10,628 | 1,427 | 1,526 |
|
| 132 |
+
| Ancora | POS | 16,678 | 13,123 | 1,709 | 1,846 |
|
| 133 |
+
| STS-ca | STS | 3,073 | 2,073 | 500 | 500 |
|
| 134 |
+
| TeCla | TC | 137,775 | 110,203 | 13,786 | 13,786|
|
| 135 |
+
| TE-ca | TE | 21,163 | 16,930 | 2,116 | 2,117
|
| 136 |
+
| VilaQuAD | QA | 6,282 | 3,882 | 1,200 | 1,200 |
|
| 137 |
+
| ViquiQuAD | QA | 14,239 | 11,255 | 1,492 | 1,429 |
|
| 138 |
+
| CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 |
|
| 139 |
+
|
| 140 |
### Evaluation results
|
| 141 |
|
| 142 |
+
This is how it compares to the teacher model when fine-tuned on the same downstream tasks:
|
| 143 |
|
| 144 |
| Task | NER (F1) | POS (F1) | STS-ca (Comb) | TeCla (Acc.) | TEca (Acc.) | VilaQuAD (F1/EM)| ViquiQuAD (F1/EM) | CatalanQA (F1/EM) | XQuAD-ca <sup>1</sup> (F1/EM) |
|
| 145 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
| 146 |
+
| RoBERTa-large-ca-v2 | 89.82 | 99.02 | 83.41 | 75.46 | 83.61 | 89.34/75.50 | 89.20/75.77 | 90.72/79.06 | 73.79/55.34 |
|
| 147 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/75.91 | 89.50/76.63 | 73.64/55.42 |
|
| 148 |
| DistilRoBERTa-base-ca-v2| xx.xx | xx.xx | xx.xx | xx.xx | xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx | xx.xx/xx.xx |
|
| 149 |
|