Feature Extraction
Transformers
gpt2
Bochkov commited on
Commit
6e67568
·
verified ·
1 Parent(s): af565c7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 🔤 bvv241-max: Cross-Model Unicode Tokenizer with Shared Token Space (vocab_size=131072, n_embed=1024)
2
+ 🧠 Overview
3
+ Constructed by matching common tokens across multiple SoTA tokenizer vocabularies:
4
+
5
+ o200k_base, cl100k, Mistral Nemo, Qwen3, QwQ, DeepSeek-R1
6
+
7
+ We found ≈19,000 common text tokens, reindexed as a universal Unicode-aligned vocabulary in a vocab_size=131072 space.
8
+ Combined with Unicode monograms.
9
+
10
+ 🧊 Embedding Matrix
11
+
12
+ Precomputed frozen embeddings: 131072 x 1024
13
+ Delivered in tensor normalized_embeddings_weights.pt
14
+
15
+ 💡 Use Cases
16
+ Multilingual base modeling
17
+ Joint-instruction alignment models
18
+ Shared embedding space for MoE architectures
19
+