Update README.md
Browse files
README.md
CHANGED
|
@@ -47,7 +47,7 @@ tags:
|
|
| 47 |
license: apache-2.0
|
| 48 |
---
|
| 49 |
|
| 50 |
-
# OpenEuroLLM Tokenizer (
|
| 51 |
|
| 52 |
A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC.
|
| 53 |
|
|
@@ -63,7 +63,7 @@ A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenizat
|
|
| 63 |
```python
|
| 64 |
from transformers import AutoTokenizer
|
| 65 |
|
| 66 |
-
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-
|
| 67 |
|
| 68 |
text = "Hello world! Bonjour le monde. Hej världen!"
|
| 69 |
ids = tok(text)["input_ids"]
|
|
|
|
| 47 |
license: apache-2.0
|
| 48 |
---
|
| 49 |
|
| 50 |
+
# OpenEuroLLM Tokenizer (256k)
|
| 51 |
|
| 52 |
A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC.
|
| 53 |
|
|
|
|
| 63 |
```python
|
| 64 |
from transformers import AutoTokenizer
|
| 65 |
|
| 66 |
+
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k")
|
| 67 |
|
| 68 |
text = "Hello world! Bonjour le monde. Hej världen!"
|
| 69 |
ids = tok(text)["input_ids"]
|