SherifAnar commited on
Commit
41b81dc
·
verified ·
1 Parent(s): 2589002

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +26 -2
  2. tokenizer_config.json +2 -1
README.md CHANGED
@@ -4,8 +4,32 @@ license: mit
4
  tags:
5
  - tokenizer
6
  - russian
 
7
  ---
8
 
9
- # russian-bpe-16k
10
 
11
- Русский токенизатор
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  tags:
5
  - tokenizer
6
  - russian
7
+ - bpe
8
  ---
9
 
10
+ # Russian BPE Tokenizer 16000
11
 
12
+ ## 🗃️ Корпус
13
+ 50k+ слов с ria.ru, lenta.ru и др. (2020–2025)
14
+
15
+ ## ⚙️ Параметры
16
+ - Алгоритм: BPE
17
+ - Размер словаря: 16,000
18
+ - Min frequency: 2
19
+
20
+ ## 📊 Метрики
21
+ - OOV rate: 1.2%
22
+ - Reconstruction accuracy: 99.8%
23
+ - Compression ratio: 1.35
24
+
25
+ ## 💻 Пример использования
26
+ ```python
27
+ from transformers import AutoTokenizer
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("SherifAnar/russian-bpe-16k")
30
+
31
+ text = "Привет, как дела?"
32
+ tokens = tokenizer.tokenize(text)
33
+ print(tokens) # ['ри', 'вет', ',', 'как', 'дела', '?']
34
+ 📜 Лицензия
35
+ MIT
tokenizer_config.json CHANGED
@@ -2,5 +2,6 @@
2
  "unk_token": "<unk>",
3
  "pad_token": "<pad>",
4
  "bos_token": "<s>",
5
- "eos_token": "</s>"
 
6
  }
 
2
  "unk_token": "<unk>",
3
  "pad_token": "<pad>",
4
  "bos_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "tokenizer_class": "PreTrainedTokenizerFast"
7
  }