gsaltintas commited on
Commit
84ca195
·
verified ·
1 Parent(s): a0fe286

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Super Vocabulary
2
+
3
+ A merged super-vocabulary built from 9 tokenizer(s).
4
+
5
+ **Vocab size:** 100007
6
+
7
+ ## Tokenizers
8
+
9
+ - `flexitok/mod-tokenizers-individual`
10
+ - `flexitok/mod-tokenizers-ltr_3digit`
11
+ - `flexitok/mod-tokenizers-ltr_2digit`
12
+ - `flexitok/mod-tokenizers-ltr_4digit`
13
+ - `flexitok/mod-tokenizers-ltr_5digit`
14
+ - `flexitok/mod-tokenizers-rtl_2digit`
15
+ - `flexitok/mod-tokenizers-rtl_3digit`
16
+ - `flexitok/mod-tokenizers-rtl_4digit`
17
+ - `flexitok/mod-tokenizers-rtl_5digit`
18
+
19
+ ## Files
20
+
21
+ - `super_vocab.json` — merged vocabulary mapping token string → super index
22
+ - `config.yaml` — model config with `vocab_size`
23
+ - `participating_tokenizers.json` — list of tokenizer names included
24
+ - `<tokenizer>_super_mapping.json` — per-tokenizer index → super index mapping
25
+ - `<tokenizer>_vocab.json` — per-tokenizer vocabulary
26
+ - `<tokenizer>_info.json` / `.yaml` — tokenizer metadata