Add model card for BVV241-max
Browse filesThis PR adds a comprehensive model card for the `Bochkov/bvv241-max` model.
It includes:
- A link to the paper "[Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129)".
- Relevant metadata: `license: apache-2.0`, `library_name: transformers`, and `pipeline_tag: text-generation`.
- A link to the official GitHub repository for the code.
- An overview of the model's purpose and specific details about the `bvv241-max` variant.
- Quick start code examples for loading the tokenizer and frozen embeddings.
- Citation information.
This update significantly improves the discoverability and usability of the model on the Hugging Face Hub.
README.md
CHANGED
|
@@ -1,63 +1,52 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
- frozen-embeddings
|
| 6 |
-
- language-model
|
| 7 |
-
- Chinese
|
| 8 |
-
- English
|
| 9 |
-
- conceptual-demo
|
| 10 |
-
- toy-model
|
| 11 |
-
- academic
|
| 12 |
-
model-index:
|
| 13 |
-
- name: Bochkov/best_bvv_zh
|
| 14 |
-
results:
|
| 15 |
-
- task:
|
| 16 |
-
type: text-generation
|
| 17 |
-
metrics:
|
| 18 |
-
- name: MMLU (average)
|
| 19 |
-
type: mmlu
|
| 20 |
-
value: 19.42
|
| 21 |
-
- name: BLEU zh-en
|
| 22 |
-
type: bleu
|
| 23 |
-
value: 7.78
|
| 24 |
---
|
| 25 |
|
| 26 |
-
#
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
-
|
| 31 |
-
- All transformer layers and output head are _trainable_.
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
-
|
| 36 |
-
- Vocabulary: 131072 (Unicode/visual + frequent n-grams).
|
| 37 |
-
- 16-layer transformer, 1024 hidden dim, 32 heads.
|
| 38 |
|
| 39 |
-
|
| 40 |
-
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
-
- Proof-of-concept for multilingual/fusion/frozen-embedding MoE research.
|
| 46 |
-
- NOT intended or suitable for actual production generation or factual knowledge (corpus ~9B tokens only).
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
| Model | Total Params | MMLU avg (%) | BLEU en-zh (%) | BLEU zh-en (%) |
|
| 51 |
-
|------------------------------------------|:------------:|:------------:|:--------------:|:--------------:|
|
| 52 |
-
| Bochkov/best_bvv_zh (frozen) | 0.5B | 19.4 | 1.41 | 7.78 |
|
| 53 |
-
| Bochkov/best_bvv_unfrozen_zh (baseline) | 0.5B | 14.0 | 1.65 | 5.93 |
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
|
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
| 59 |
|
|
|
|
|
|
|
| 60 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
| 62 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
| 63 |
author={A. Bochkov},
|
|
@@ -77,23 +66,4 @@ If you use or build upon this demo, please cite:
|
|
| 77 |
primaryClass={cs.LG},
|
| 78 |
url={https://arxiv.org/abs/2507.07129},
|
| 79 |
}
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
|
| 83 |
-
|
| 84 |
-
## Example Usage
|
| 85 |
-
```python
|
| 86 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 87 |
-
import torch
|
| 88 |
-
model = AutoModelForCausalLM.from_pretrained('Bochkov/best_bvv_ru', trust_remote_code=True).to('cuda')
|
| 89 |
-
tokenizer = AutoTokenizer.from_pretrained('Bochkov/best_bvv_ru')
|
| 90 |
-
inputs = tokenizer("Hello, мир! ", return_tensors="pt").to('cuda')
|
| 91 |
-
outputs = model.generate(
|
| 92 |
-
**inputs,
|
| 93 |
-
max_new_tokens=100,
|
| 94 |
-
temperature=0.8,
|
| 95 |
-
top_k=50,
|
| 96 |
-
top_p=0.95,
|
| 97 |
-
do_sample=True
|
| 98 |
-
)
|
| 99 |
-
print(tokenizer.decode(outputs[0]))
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# BVV241-max: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
|
| 8 |
|
| 9 |
+
This repository hosts the `bvv241-max` tokenizer and frozen embedding resources, which are a key part of the research presented in the paper [Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129).
|
| 10 |
|
| 11 |
+
This work explores an alternative, constructive approach to large language model (LLM) development, built upon non-trainable, deterministic input embeddings. It demonstrates how such a fixed representational substrate can act as a universal "docking port" for seamless modular composition and progressive layer-wise growth of Transformers, enabling resource-efficient scaling and continual learning.
|
|
|
|
| 12 |
|
| 13 |
+
For the full code, benchmarks, and research notebooks, please refer to the [GitHub repository](https://github.com/Bochkov/bvv241).
|
| 14 |
|
| 15 |
+
## BVV241-max Tokenizer and Embedding Details
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
The `bvv241-max` variant offers:
|
| 18 |
+
- **Unicode monograms** + bigrams/trigrams + *intersection of token strings* across SOTA models (e.g., o200k_base, cl100k_base, Mistral-Nemo, DeepSeek-R1).
|
| 19 |
+
- **Vocabulary**: 131,072 tokens.
|
| 20 |
+
- **Embedding**: 1024-dim, L2-normalized, frozen (**no semantic information**).
|
| 21 |
|
| 22 |
+
These frozen embeddings are suitable for research into semantic emergence when training transformers with fixed, non-semantic ("surface-level") embeddings, and for plug-and-play modular/MoE experiments.
|
| 23 |
|
| 24 |
+
## Quick Start
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
You can load the tokenizer and the frozen embedding matrix using the `transformers` library and `huggingface_hub`:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
```python
|
| 29 |
+
from transformers import AutoTokenizer
|
| 30 |
+
from huggingface_hub import hf_hub_download
|
| 31 |
+
import torch
|
| 32 |
|
| 33 |
+
# Load the tokenizer
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
|
| 35 |
|
| 36 |
+
# Download and load the precomputed frozen embeddings
|
| 37 |
+
emb_path = hf_hub_download(repo_id="Bochkov/bvv241-max", filename="normalized_embeddings_weights.pt")
|
| 38 |
+
embeddings = torch.load(emb_path) # shape: [vocab_size, emb_dim]
|
| 39 |
|
| 40 |
+
print(f"Tokenizer vocabulary size: {len(tokenizer)}")
|
| 41 |
+
print(f"Embeddings shape: {embeddings.shape}")
|
| 42 |
```
|
| 43 |
+
*Note: This repository primarily provides the tokenizer and frozen embeddings. For a full model capable of text generation, you would typically load a separate model checkpoint that has been trained using these embeddings, ensuring `trust_remote_code=True` if a custom architecture is involved.*
|
| 44 |
+
|
| 45 |
+
## Citation
|
| 46 |
+
|
| 47 |
+
If you use or build upon this work, please cite the associated papers:
|
| 48 |
+
|
| 49 |
+
```bibtex
|
| 50 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
| 51 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
| 52 |
author={A. Bochkov},
|
|
|
|
| 66 |
primaryClass={cs.LG},
|
| 67 |
url={https://arxiv.org/abs/2507.07129},
|
| 68 |
}
|
| 69 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|