Add model card for BVV241-max

This PR adds a comprehensive model card for the `Bochkov/bvv241-max` model.

It includes:
- A link to the paper "[Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129)".
- Relevant metadata: `license: apache-2.0`, `library_name: transformers`, and `pipeline_tag: text-generation`.
- A link to the official GitHub repository for the code.
- An overview of the model's purpose and specific details about the `bvv241-max` variant.
- Quick start code examples for loading the tokenizer and frozen embeddings.
- Citation information.

This update significantly improves the discoverability and usability of the model on the Hugging Face Hub.

Files changed (1) hide show

README.md +33 -63

README.md CHANGED Viewed

@@ -1,63 +1,52 @@
 ---
 license: apache-2.0
-tags:
-  - bvv
-  - frozen-embeddings
-  - language-model
-  - Chinese
-  - English
-  - conceptual-demo
-  - toy-model
-  - academic
-model-index:
-  - name: Bochkov/best_bvv_zh
-    results:
-      - task:
-          type: text-generation
-        metrics:
-          - name: MMLU (average)
-            type: mmlu
-            value: 19.42
-          - name: BLEU zh-en
-            type: bleu
-            value: 7.78
 ---
-# best_bvv_zh
-**best_bvv_zh** is a conceptual bilingual (English + Chinese) transformer language model trained from scratch on a limited-size 9B-token corpus, as a **demonstration of the frozen-embedding hypothesis** for robust, language-agnostic and easily-combinable language models.
-- **Embedding matrix is _frozen_ after visual-based (Unicode-morpheme) initialization.**
-- All transformer layers and output head are _trainable_.
-## Key features
-- **Trained on small English+Chinese dataset.**
-- Vocabulary: 131072 (Unicode/visual + frequent n-grams).
-- 16-layer transformer, 1024 hidden dim, 32 heads.
-- **Demonstrates that frozen, compositional, language-agnostic embeddings allow for stable representation learning and can be directly combined into Mixture-of-Experts (MoE) models.**
-- Direct comparison to "unfrozen" version [Bochkov/best_bvv_unfrozen_zh](https://huggingface.co/Bochkov/best_bvv_unfrozen_zh).
-## Intended use
-- **Academic and engineering demonstration.**
-- Proof-of-concept for multilingual/fusion/frozen-embedding MoE research.
-- NOT intended or suitable for actual production generation or factual knowledge (corpus ~9B tokens only).
-## Model comparison (vs unfrozen baseline)
-| Model                                   | Total Params | MMLU avg (%) | BLEU en-zh (%) | BLEU zh-en (%) |
-|------------------------------------------|:------------:|:------------:|:--------------:|:--------------:|
-| Bochkov/best_bvv_zh (frozen)             |    0.5B      |    19.4      |     1.41       |     7.78       |
-| Bochkov/best_bvv_unfrozen_zh (baseline)  |    0.5B      |    14.0      |     1.65       |     5.93       |
-## 🧑‍🔬 Citation & Concept
-If you use or build upon this demo, please cite:
 ```
 @misc{bochkov2025emergentsemanticstokenembeddings,
       title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
       author={A. Bochkov},
@@ -77,23 +66,4 @@ If you use or build upon this demo, please cite:
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2507.07129},
 }
-```
-This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
-## Example Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model = AutoModelForCausalLM.from_pretrained('Bochkov/best_bvv_ru', trust_remote_code=True).to('cuda')
-tokenizer = AutoTokenizer.from_pretrained('Bochkov/best_bvv_ru')
-inputs = tokenizer("Hello, мир! ", return_tensors="pt").to('cuda')
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=100,
-    temperature=0.8,
-    top_k=50,
-    top_p=0.95,
-    do_sample=True
-)
-print(tokenizer.decode(outputs[0]))

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
 ---
+# BVV241-max: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
+This repository hosts the `bvv241-max` tokenizer and frozen embedding resources, which are a key part of the research presented in the paper [Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129).
+This work explores an alternative, constructive approach to large language model (LLM) development, built upon non-trainable, deterministic input embeddings. It demonstrates how such a fixed representational substrate can act as a universal "docking port" for seamless modular composition and progressive layer-wise growth of Transformers, enabling resource-efficient scaling and continual learning.
+For the full code, benchmarks, and research notebooks, please refer to the [GitHub repository](https://github.com/Bochkov/bvv241).
+## BVV241-max Tokenizer and Embedding Details
+The `bvv241-max` variant offers:
+-   **Unicode monograms** + bigrams/trigrams + *intersection of token strings* across SOTA models (e.g., o200k_base, cl100k_base, Mistral-Nemo, DeepSeek-R1).
+-   **Vocabulary**: 131,072 tokens.
+-   **Embedding**: 1024-dim, L2-normalized, frozen (**no semantic information**).
+These frozen embeddings are suitable for research into semantic emergence when training transformers with fixed, non-semantic ("surface-level") embeddings, and for plug-and-play modular/MoE experiments.
+## Quick Start
+You can load the tokenizer and the frozen embedding matrix using the `transformers` library and `huggingface_hub`:
+```python
+from transformers import AutoTokenizer
+from huggingface_hub import hf_hub_download
+import torch
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained('Bochkov/bvv241-max')
+# Download and load the precomputed frozen embeddings
+emb_path = hf_hub_download(repo_id="Bochkov/bvv241-max", filename="normalized_embeddings_weights.pt")
+embeddings = torch.load(emb_path)  # shape: [vocab_size, emb_dim]
+print(f"Tokenizer vocabulary size: {len(tokenizer)}")
+print(f"Embeddings shape: {embeddings.shape}")
 ```
+*Note: This repository primarily provides the tokenizer and frozen embeddings. For a full model capable of text generation, you would typically load a separate model checkpoint that has been trained using these embeddings, ensuring `trust_remote_code=True` if a custom architecture is involved.*
+## Citation
+If you use or build upon this work, please cite the associated papers:
+```bibtex
 @misc{bochkov2025emergentsemanticstokenembeddings,
       title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
       author={A. Bochkov},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2507.07129},
 }
+```