Bochkov commited on
Commit
a3bb9c4
·
verified ·
1 Parent(s): 2425778

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text-generation
5
+ - causal-lm
6
+ - transformer
7
+ - research
8
+ - interpretability
9
+ - multilingual
10
+ - unicode
11
+ - frozen-embeddings
12
+ - ablation
13
+ language:
14
+ - multilingual
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Emergent Semantics — Model_1024_FLOAT (335M)
20
+
21
+ This repository provides **Model_1024_FLOAT (335M)** — an **ablation model** from the paper:
22
+
23
+ [📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
24
+
25
+ [📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
26
+
27
+ This checkpoint is designed to isolate the effect of **float-valued / normalized frozen embeddings** versus **binary frozen embeddings**, while keeping the Transformer backbone and training setup the same.
28
+
29
+ ---
30
+
31
+ ## What this ablation is
32
+
33
+ **Model_1024_FLOAT** uses a frozen embedding table where:
34
+
35
+ - **`n_embed = 1024`** (embedding dimensionality equals `d_model`)
36
+ - Each token embedding is a **float vector**
37
+ - The embedding vectors are derived from a **random (non-semantic) codebook** and then **normalized** (e.g., L2 normalization) to control scale
38
+ - The embedding weights are **frozen** (`requires_grad=False`) for the entire training run
39
+
40
+ This model is part of an ablation series that tests whether differences in training dynamics / downstream reasoning come from:
41
+ - semantic structure in embeddings (hypothesis: not required),
42
+ - *or simply* numeric properties like dtype/scale/normalization.
43
+
44
+ ---
45
+
46
+ ## Relation to other models in the collection
47
+
48
+ - Compared to **Model_1024_BIT (335M)**:
49
+ - Same backbone (`d_model=1024`, 16 layers, 32 heads, RoPE, GELU)
50
+ - Same embedding dimensionality (`n_embed=1024`)
51
+ - Difference is the embedding representation:
52
+ - **1024_BIT:** frozen random **binary** vectors
53
+ - **1024_FLOAT:** frozen random **float** vectors with **normalization**
54
+
55
+ - Compared to **Model_UNI_GLYPH (335M)**:
56
+ - Same embedding dimensionality and frozen setup
57
+ - UNI_GLYPH embeddings come from glyph-rendering + PCA; here embeddings are random and intended to be non-semantic
58
+
59
+ - Compared to **Model_unfrozen (335M)**:
60
+ - Same architecture
61
+ - Here embeddings are frozen; in the baseline they are trainable
62
+
63
+ Because `n_embed=1024`, this model is in the same **parameter-count class (~335M)** as UNI_GLYPH and the unfrozen baseline.
64
+
65
+ ---
66
+
67
+ ## Model summary
68
+
69
+ - **Architecture:** decoder-only Transformer (GPT-like)
70
+ - **Hidden size (`d_model`):** 1024
71
+ - **Layers:** 16
72
+ - **Heads:** 32
73
+ - **Positional encoding:** rotary embeddings
74
+ - **Activation:** GELU
75
+ - **Vocabulary size:** 65,536
76
+ - **Tokenizer:** `Bochkov/bvv241-2-3` compatible
77
+ - **Input embeddings:** frozen, random **float**, **normalized**, `n_embed=1024`
78
+ - **Output head:** **not tied** to the input embeddings (trained separately)
79
+
80
+ ---
81
+
82
+ ## Tokenizer
83
+
84
+ The intended tokenizer is **bvv241-2-3**:
85
+
86
+ - https://huggingface.co/Bochkov/bvv241-2-3
87
+
88
+ You can load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is **exact vocab alignment**.
89
+
90
+ ---
91
+
92
+ ## How to use (Transformers)
93
+
94
+ ```python
95
+
96
+ import torch
97
+ from transformers import AutoTokenizer, AutoModelForCausalLM
98
+
99
+ tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-1024-float-335m")
100
+ model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-1024-float-335m", trust_remote_code=True)
101
+
102
+ inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
103
+
104
+ outputs = model.generate(
105
+ inputs,
106
+ max_new_tokens=10,
107
+ do_sample=False
108
+ )
109
+ print(tokenizer.decode(outputs[0].tolist()))
110
+
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Intended use
116
+
117
+ Research-only checkpoint intended for:
118
+
119
+ - Studying **emergent semantics** with a frozen random float codebook
120
+ - Isolating the impact of **normalization / vector scale** in frozen embeddings
121
+ - Comparisons against **1024_BIT** and **UNI_GLYPH** under identical backbone/training conditions
122
+
123
+ Not intended for production deployment (no safety/instruction tuning).
124
+
125
+ ---
126
+
127
+ ## Related links
128
+
129
+ - **Model collection (paper artifacts):**
130
+ https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
131
+ - **UNI_GLYPH model (frozen visual glyph embeddings):**
132
+ https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m
133
+ - **1024_BIT model (binary random frozen embeddings):**
134
+ https://huggingface.co/Bochkov/emergent-semantics-model-1024-bit-335m
135
+ - **Tokenizer:**
136
+ https://huggingface.co/Bochkov/bvv241-2-3
137
+ - **Code (GitHub):**
138
+ https://github.com/AVBochkov/Embeddings
139
+
140
+ ---
141
+
142
+ ## 🧑‍🔬 Citation & Concept
143
+ If you use this model or the underlying concepts in your research, please cite our work:
144
+ ```
145
+ @article{
146
+ bochkov2025emergent,
147
+ title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
148
+ author={Andrey Bochkov},
149
+ journal={Transactions on Machine Learning Research},
150
+ issn={2835-8856},
151
+ year={2025},
152
+ url={https://openreview.net/forum?id=Odh8IynO1o},
153
+ note={}
154
+ }
155
+ @misc{bochkov2025growingtransformersmodularcomposition,
156
+ title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
157
+ author={A. Bochkov},
158
+ year={2025},
159
+ eprint={2507.07129},
160
+ archivePrefix={arXiv},
161
+ primaryClass={cs.LG},
162
+ url={https://arxiv.org/abs/2507.07129},
163
+ }
164
+ ```