nielsr HF Staff commited on
Commit
e12fa90
·
verified ·
1 Parent(s): ba92c72

Add model card for BVV241-max

Browse files

This PR adds a comprehensive model card for the `Bochkov/bvv241-max` model.

It includes:
- A link to the paper "[Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129)".
- Relevant metadata: `license: apache-2.0`, `library_name: transformers`, and `pipeline_tag: text-generation`.
- A link to the official GitHub repository for the code.
- An overview of the model's purpose and specific details about the `bvv241-max` variant.
- Quick start code examples for loading the tokenizer and frozen embeddings.
- Citation information.

This update significantly improves the discoverability and usability of the model on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +33 -63
README.md CHANGED
@@ -1,63 +1,52 @@
1
  ---
2
  license: apache-2.0
3
- tags:
4
- - bvv
5
- - frozen-embeddings
6
- - language-model
7
- - Chinese
8
- - English
9
- - conceptual-demo
10
- - toy-model
11
- - academic
12
- model-index:
13
- - name: Bochkov/best_bvv_zh
14
- results:
15
- - task:
16
- type: text-generation
17
- metrics:
18
- - name: MMLU (average)
19
- type: mmlu
20
- value: 19.42
21
- - name: BLEU zh-en
22
- type: bleu
23
- value: 7.78
24
  ---
25
 
26
- # best_bvv_zh
27
 
28
- **best_bvv_zh** is a conceptual bilingual (English + Chinese) transformer language model trained from scratch on a limited-size 9B-token corpus, as a **demonstration of the frozen-embedding hypothesis** for robust, language-agnostic and easily-combinable language models.
29
 
30
- - **Embedding matrix is _frozen_ after visual-based (Unicode-morpheme) initialization.**
31
- - All transformer layers and output head are _trainable_.
32
 
33
- ## Key features
34
 
35
- - **Trained on small English+Chinese dataset.**
36
- - Vocabulary: 131072 (Unicode/visual + frequent n-grams).
37
- - 16-layer transformer, 1024 hidden dim, 32 heads.
38
 
39
- - **Demonstrates that frozen, compositional, language-agnostic embeddings allow for stable representation learning and can be directly combined into Mixture-of-Experts (MoE) models.**
40
- - Direct comparison to "unfrozen" version [Bochkov/best_bvv_unfrozen_zh](https://huggingface.co/Bochkov/best_bvv_unfrozen_zh).
 
 
41
 
42
- ## Intended use
43
 
44
- - **Academic and engineering demonstration.**
45
- - Proof-of-concept for multilingual/fusion/frozen-embedding MoE research.
46
- - NOT intended or suitable for actual production generation or factual knowledge (corpus ~9B tokens only).
47
 
48
- ## Model comparison (vs unfrozen baseline)
49
-
50
- | Model | Total Params | MMLU avg (%) | BLEU en-zh (%) | BLEU zh-en (%) |
51
- |------------------------------------------|:------------:|:------------:|:--------------:|:--------------:|
52
- | Bochkov/best_bvv_zh (frozen) | 0.5B | 19.4 | 1.41 | 7.78 |
53
- | Bochkov/best_bvv_unfrozen_zh (baseline) | 0.5B | 14.0 | 1.65 | 5.93 |
54
 
 
 
 
 
55
 
56
- ## 🧑‍🔬 Citation & Concept
 
57
 
58
- If you use or build upon this demo, please cite:
 
 
59
 
 
 
60
  ```
 
 
 
 
 
 
 
61
  @misc{bochkov2025emergentsemanticstokenembeddings,
62
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
63
  author={A. Bochkov},
@@ -77,23 +66,4 @@ If you use or build upon this demo, please cite:
77
  primaryClass={cs.LG},
78
  url={https://arxiv.org/abs/2507.07129},
79
  }
80
- ```
81
-
82
- This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
83
-
84
- ## Example Usage
85
- ```python
86
- from transformers import AutoModelForCausalLM, AutoTokenizer
87
- import torch
88
- model = AutoModelForCausalLM.from_pretrained('Bochkov/best_bvv_ru', trust_remote_code=True).to('cuda')
89
- tokenizer = AutoTokenizer.from_pretrained('Bochkov/best_bvv_ru')
90
- inputs = tokenizer("Hello, мир! ", return_tensors="pt").to('cuda')
91
- outputs = model.generate(
92
- **inputs,
93
- max_new_tokens=100,
94
- temperature=0.8,
95
- top_k=50,
96
- top_p=0.95,
97
- do_sample=True
98
- )
99
- print(tokenizer.decode(outputs[0]))
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
+ # BVV241-max: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
8
 
9
+ This repository hosts the `bvv241-max` tokenizer and frozen embedding resources, which are a key part of the research presented in the paper [Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129).
10
 
11
+ This work explores an alternative, constructive approach to large language model (LLM) development, built upon non-trainable, deterministic input embeddings. It demonstrates how such a fixed representational substrate can act as a universal "docking port" for seamless modular composition and progressive layer-wise growth of Transformers, enabling resource-efficient scaling and continual learning.
 
12
 
13
+ For the full code, benchmarks, and research notebooks, please refer to the [GitHub repository](https://github.com/Bochkov/bvv241).
14
 
15
+ ## BVV241-max Tokenizer and Embedding Details
 
 
16
 
17
+ The `bvv241-max` variant offers:
18
+ - **Unicode monograms** + bigrams/trigrams + *intersection of token strings* across SOTA models (e.g., o200k_base, cl100k_base, Mistral-Nemo, DeepSeek-R1).
19
+ - **Vocabulary**: 131,072 tokens.
20
+ - **Embedding**: 1024-dim, L2-normalized, frozen (**no semantic information**).
21
 
22
+ These frozen embeddings are suitable for research into semantic emergence when training transformers with fixed, non-semantic ("surface-level") embeddings, and for plug-and-play modular/MoE experiments.
23
 
24
+ ## Quick Start
 
 
25
 
26
+ You can load the tokenizer and the frozen embedding matrix using the `transformers` library and `huggingface_hub`:
 
 
 
 
 
27
 
28
+ ```python
29
+ from transformers import AutoTokenizer
30
+ from huggingface_hub import hf_hub_download
31
+ import torch
32
 
33
+ # Load the tokenizer
34
+ tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
35
 
36
+ # Download and load the precomputed frozen embeddings
37
+ emb_path = hf_hub_download(repo_id="Bochkov/bvv241-max", filename="normalized_embeddings_weights.pt")
38
+ embeddings = torch.load(emb_path) # shape: [vocab_size, emb_dim]
39
 
40
+ print(f"Tokenizer vocabulary size: {len(tokenizer)}")
41
+ print(f"Embeddings shape: {embeddings.shape}")
42
  ```
43
+ *Note: This repository primarily provides the tokenizer and frozen embeddings. For a full model capable of text generation, you would typically load a separate model checkpoint that has been trained using these embeddings, ensuring `trust_remote_code=True` if a custom architecture is involved.*
44
+
45
+ ## Citation
46
+
47
+ If you use or build upon this work, please cite the associated papers:
48
+
49
+ ```bibtex
50
  @misc{bochkov2025emergentsemanticstokenembeddings,
51
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
52
  author={A. Bochkov},
 
66
  primaryClass={cs.LG},
67
  url={https://arxiv.org/abs/2507.07129},
68
  }
69
+ ```