Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ This model is released as a **base model**, intended for further fine-tuning or
|
|
| 18 |
|
| 19 |
## Training Data
|
| 20 |
|
| 21 |
-
To train language-specific base LLMs, we followed the methodology proposed by Etxaniz et al. (2024), originally developed for Basque, and extended it to other low-resource languages. To enable fair comparisons across languages, we limited the corpus size for each language to roughly the same number of tokens. We also included a small English subset to mitigate catastrophic forgetting.
|
| 22 |
|
| 23 |
### Corpus composition
|
| 24 |
|
|
|
|
| 18 |
|
| 19 |
## Training Data
|
| 20 |
|
| 21 |
+
To train language-specific base LLMs, we followed the methodology proposed by [Etxaniz et al. (2024)](https://aclanthology.org/2024.acl-long.799/), originally developed for Basque, and extended it to other low-resource languages. To enable fair comparisons across languages, we limited the corpus size for each language to roughly the same number of tokens. We also included a small English subset to mitigate catastrophic forgetting.
|
| 22 |
|
| 23 |
### Corpus composition
|
| 24 |
|