Feature Extraction
Transformers
gpt2
Bochkov commited on
Commit
9021e1f
·
verified ·
1 Parent(s): 2b3556f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -1
README.md CHANGED
@@ -11,7 +11,13 @@
11
  <!-- Provide a longer summary of what this model is. -->
12
 
13
  This tokenizer is based on a hybrid vocabulary:
14
- - Most common Unicode codepoints (monograms),
 
 
 
 
 
 
15
  - Tokenizer created from the intersection of token text across leading SOTA models
16
  - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
17
  - Vocabulary size: 131,072 tokens,
 
11
  <!-- Provide a longer summary of what this model is. -->
12
 
13
  This tokenizer is based on a hybrid vocabulary:
14
+
15
+ This tokenizer uses a strictly structured Unicode mapping scheme:
16
+
17
+ - Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
18
+ - Private and unused code ranges (Plane 0 high + supplementary, e.g., 0xE000–0xF8FF and 65536–131071):
19
+ - All multi-character tokens (bigrams, trigrams, SOTA model token strings) are placed exclusively in these ranges.
20
+ - This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
21
  - Tokenizer created from the intersection of token text across leading SOTA models
22
  - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
23
  - Vocabulary size: 131,072 tokens,