Bochkov
/

bvv241-max

Feature Extraction

Model card Files Files and versions

Bochkov commited on Jun 15, 2025

Commit

9021e1f

·

verified ·

1 Parent(s): 2b3556f

Update README.md

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -11,7 +11,13 @@
 <!-- Provide a longer summary of what this model is. -->
 This tokenizer is based on a hybrid vocabulary:
-- Most common Unicode codepoints (monograms),
 - Tokenizer created from the intersection of token text across leading SOTA models
 - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
 - Vocabulary size: 131,072 tokens,

 <!-- Provide a longer summary of what this model is. -->
 This tokenizer is based on a hybrid vocabulary:
+This tokenizer uses a strictly structured Unicode mapping scheme:
+- Plane 0 (0–65535): All single Unicode code points (monograms) are mapped 1:1 to token codes, directly matching standard Unicode BMP.
+- Private and unused code ranges (Plane 0 high + supplementary, e.g., 0xE000–0xF8FF and 65536–131071):
+    - All multi-character tokens (bigrams, trigrams, SOTA model token strings) are placed exclusively in these ranges.
+- This design achieves total, lossless Unicode text coverage, with all multi-symbol tokens isolated above the core Unicode range.
 - Tokenizer created from the intersection of token text across leading SOTA models
 - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
 - Vocabulary size: 131,072 tokens,