Feature Extraction
Transformers
gpt2
Bochkov commited on
Commit
2b3556f
·
verified ·
1 Parent(s): 6e67568

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -13
README.md CHANGED
@@ -1,19 +1,44 @@
1
- 🔤 bvv241-max: Cross-Model Unicode Tokenizer with Shared Token Space (vocab_size=131072, n_embed=1024)
2
- 🧠 Overview
3
- Constructed by matching common tokens across multiple SoTA tokenizer vocabularies:
 
 
4
 
5
- o200k_base, cl100k, Mistral Nemo, Qwen3, QwQ, DeepSeek-R1
6
 
7
- We found ≈19,000 common text tokens, reindexed as a universal Unicode-aligned vocabulary in a vocab_size=131072 space.
8
- Combined with Unicode monograms.
9
 
10
- 🧊 Embedding Matrix
11
 
12
- Precomputed frozen embeddings: 131072 x 1024
13
- Delivered in tensor normalized_embeddings_weights.pt
 
 
 
 
14
 
15
- 💡 Use Cases
16
- Multilingual base modeling
17
- Joint-instruction alignment models
18
- Shared embedding space for MoE architectures
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
 
7
+ # bvv241-max: Unified Unicode Tokenizer (SOTA Intersection) with Frozen Embeddings
8
 
9
+ ## Tokenizer Description
 
10
 
11
+ <!-- Provide a longer summary of what this model is. -->
12
 
13
+ This tokenizer is based on a hybrid vocabulary:
14
+ - Most common Unicode codepoints (monograms),
15
+ - Tokenizer created from the intersection of token text across leading SOTA models
16
+ - Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
17
+ - Vocabulary size: 131,072 tokens,
18
+ - Embedding dimension: 1024.
19
 
20
+ The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
21
+ No semantic information is encoded; embeddings remain fixed throughout LM pretraining.
22
+ No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs.
23
+
24
+
25
+ ## How to Get Started with the Tokenizer
26
+
27
+ Use the code below:
28
+
29
+ from transformers import AutoTokenizer
30
+
31
+ from huggingface_hub import hf_hub_download
32
+
33
+ import torch
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
36
+
37
+
38
+ emb_path = hf_hub_download(
39
+ repo_id="Bochkov/bvv241-max",
40
+ filename="normalized_embeddings_weights.pt"
41
+ )
42
+
43
+ embeddings = torch.load(emb_path)
44