Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
| 3 |
datasets:
|
| 4 |
- bigscience/xP3
|
| 5 |
- mc4
|
|
|
|
| 6 |
license: apache-2.0
|
| 7 |
language:
|
| 8 |
- en
|
|
@@ -24,10 +25,29 @@ The vocabulary has been adjusted to contain the top 2000 tokens overall as well
|
|
| 24 |
|
| 25 |
The new vocabulary has been computed using
|
| 26 |
- deu_news_2023_1M for de
|
| 27 |
-
- eng_news_2023_1M for en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
from the [Leipzig language corpora collection](https://wortschatz.uni-leipzig.de/en/download/).
|
| 29 |
|
| 30 |
-
According to the mentioned article the perfarmance degradation can be expected to be small (but no evaluation was done.
|
| 31 |
|
| 32 |

|
| 33 |
|
|
|
|
| 3 |
datasets:
|
| 4 |
- bigscience/xP3
|
| 5 |
- mc4
|
| 6 |
+
- Leipzig_corpora_collection
|
| 7 |
license: apache-2.0
|
| 8 |
language:
|
| 9 |
- en
|
|
|
|
| 25 |
|
| 26 |
The new vocabulary has been computed using
|
| 27 |
- deu_news_2023_1M for de
|
| 28 |
+
- eng_news_2023_1M for en
|
| 29 |
+
|
| 30 |
+
from the [Leipzig language corpora collection](https://wortschatz.uni-leipzig.de/en/download/).
|
| 31 |
+
|
| 32 |
+
According to the mentioned article the perfarmance degradation can be expected to be small (but no evaluation was done).
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# Language Adaption of mT0
|
| 37 |
+
This model is an adaption of the mT0 variant (see below) for languages en, de based on the methodology in [Load What You Need: Smaller Versions of Multilingual BERT](https://arxiv.org/ftp/arxiv/papers/2010/2010.05609.pdf).
|
| 38 |
+
|
| 39 |
+
The vocabulary has been adjusted to contain the top 2000 tokens overall as well as
|
| 40 |
+
- the top 15000 tokens from en
|
| 41 |
+
- the top 30000 tokens from de
|
| 42 |
+
and 100 special tokens.
|
| 43 |
+
|
| 44 |
+
The new vocabulary has been computed using
|
| 45 |
+
- deu_news_2023_1M for de
|
| 46 |
+
- eng_news_2023_1M for en
|
| 47 |
+
|
| 48 |
from the [Leipzig language corpora collection](https://wortschatz.uni-leipzig.de/en/download/).
|
| 49 |
|
| 50 |
+
According to the mentioned article the perfarmance degradation can be expected to be small (but no evaluation was done).
|
| 51 |
|
| 52 |

|
| 53 |
|