Upload folder using huggingface_hub
Browse files- README.md +26 -2
- tokenizer_config.json +2 -1
README.md
CHANGED
|
@@ -4,8 +4,32 @@ license: mit
|
|
| 4 |
tags:
|
| 5 |
- tokenizer
|
| 6 |
- russian
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
-
#
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- tokenizer
|
| 6 |
- russian
|
| 7 |
+
- bpe
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# Russian BPE Tokenizer 16000
|
| 11 |
|
| 12 |
+
## 🗃️ Корпус
|
| 13 |
+
50k+ слов с ria.ru, lenta.ru и др. (2020–2025)
|
| 14 |
+
|
| 15 |
+
## ⚙️ Параметры
|
| 16 |
+
- Алгоритм: BPE
|
| 17 |
+
- Размер словаря: 16,000
|
| 18 |
+
- Min frequency: 2
|
| 19 |
+
|
| 20 |
+
## 📊 Метрики
|
| 21 |
+
- OOV rate: 1.2%
|
| 22 |
+
- Reconstruction accuracy: 99.8%
|
| 23 |
+
- Compression ratio: 1.35
|
| 24 |
+
|
| 25 |
+
## 💻 Пример использования
|
| 26 |
+
```python
|
| 27 |
+
from transformers import AutoTokenizer
|
| 28 |
+
|
| 29 |
+
tokenizer = AutoTokenizer.from_pretrained("SherifAnar/russian-bpe-16k")
|
| 30 |
+
|
| 31 |
+
text = "Привет, как дела?"
|
| 32 |
+
tokens = tokenizer.tokenize(text)
|
| 33 |
+
print(tokens) # ['ри', 'вет', ',', 'как', 'дела', '?']
|
| 34 |
+
📜 Лицензия
|
| 35 |
+
MIT
|
tokenizer_config.json
CHANGED
|
@@ -2,5 +2,6 @@
|
|
| 2 |
"unk_token": "<unk>",
|
| 3 |
"pad_token": "<pad>",
|
| 4 |
"bos_token": "<s>",
|
| 5 |
-
"eos_token": "</s>"
|
|
|
|
| 6 |
}
|
|
|
|
| 2 |
"unk_token": "<unk>",
|
| 3 |
"pad_token": "<pad>",
|
| 4 |
"bos_token": "<s>",
|
| 5 |
+
"eos_token": "</s>",
|
| 6 |
+
"tokenizer_class": "PreTrainedTokenizerFast"
|
| 7 |
}
|