Bochkov
/

bvv241-max

Feature Extraction

Model card Files Files and versions

Bochkov commited on Jun 15, 2025

Commit

2b3556f

·

verified ·

1 Parent(s): 6e67568

Update README.md

Files changed (1) hide show

README.md +38 -13

README.md CHANGED Viewed

@@ -1,19 +1,44 @@
-🔤 bvv241-max: Cross-Model Unicode Tokenizer with Shared Token Space (vocab_size=131072, n_embed=1024)
-🧠 Overview
-Constructed by matching common tokens across multiple SoTA tokenizer vocabularies:
-o200k_base, cl100k, Mistral Nemo, Qwen3, QwQ, DeepSeek-R1
-We found ≈19,000 common text tokens, reindexed as a universal Unicode-aligned vocabulary in a vocab_size=131072 space.
-Combined with Unicode monograms.
-🧊 Embedding Matrix
-Precomputed frozen embeddings: 131072 x 1024
-Delivered in tensor normalized_embeddings_weights.pt
-💡 Use Cases
-Multilingual base modeling
-Joint-instruction alignment models
-Shared embedding space for MoE architectures

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# bvv241-max: Unified Unicode Tokenizer (SOTA Intersection) with Frozen Embeddings
+## Tokenizer Description
+<!-- Provide a longer summary of what this model is. -->
+This tokenizer is based on a hybrid vocabulary:
+- Most common Unicode codepoints (monograms),
+- Tokenizer created from the intersection of token text across leading SOTA models
+- Includes o200k_base, cl100k_base, Mistral-Nemo, QwQ-32B, DeepSeek-R1, Qwen3-32B vocabularies,
+- Vocabulary size: 131,072 tokens,
+- Embedding dimension: 1024.
+The associated `normalized_embeddings_weights.pt` file contains a [vocab_size x embed_dim] matrix of precomputed, L2-normalized, frozen embeddings.
+No semantic information is encoded; embeddings remain fixed throughout LM pretraining.
+No training or adaptation; suitable for plug-and-play use in research on embedding-free semantic emergence and modular LMs.
+## How to Get Started with the Tokenizer
+Use the code below:
+from transformers import AutoTokenizer
+from huggingface_hub import hf_hub_download
+import torch
+tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
+emb_path = hf_hub_download(
+    repo_id="Bochkov/bvv241-max",
+    filename="normalized_embeddings_weights.pt"
+)
+embeddings = torch.load(emb_path)