gsaltintas commited on
Commit
b95e2a0
·
verified ·
1 Parent(s): 4d2a76c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Super Vocabulary
2
+
3
+ A merged super-vocabulary built from 11 tokenizer(s).
4
+
5
+ **Vocab size:** 163711
6
+
7
+ ## Tokenizers
8
+
9
+ - `flexitok/bpe_script_Arab_16000`
10
+ - `flexitok/bpe_script_CmJp_16000`
11
+ - `flexitok/bpe_ltr_ell_Grek_8000_v2`
12
+ - `flexitok/bpe_ltr_fw_edu_32000_v2`
13
+ - `flexitok/bpe_ltr_hun_Latn_8000_v2`
14
+ - `flexitok/bpe_ltr_rus_Cyrl_16000_v2`
15
+ - `flexitok/bpe_ltr_tur_Latn_8000_v2`
16
+ - `flexitok/bpe_script_Germ_32000`
17
+ - `flexitok/bpe_script_Roma_32000`
18
+ - `flexitok/bpe_script_SEAS_16000`
19
+ - `flexitok/bpe_script_Slav_16000`
20
+
21
+ ## Files
22
+
23
+ - `super_vocab.json` — merged vocabulary mapping token string → super index
24
+ - `config.yaml` — model config with `vocab_size`
25
+ - `participating_tokenizers.json` — list of tokenizer names included
26
+ - `<tokenizer>_super_mapping.json` — per-tokenizer index → super index mapping
27
+ - `<tokenizer>_vocab.json` — per-tokenizer vocabulary
28
+ - `<tokenizer>_info.json` / `.yaml` — tokenizer metadata