Bochkov commited on
Commit
7b64028
·
verified ·
1 Parent(s): 7b11504

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text-generation
5
+ - causal-lm
6
+ - transformer
7
+ - research
8
+ - interpretability
9
+ - multilingual
10
+ - unicode
11
+ - frozen-embeddings
12
+ - ablation
13
+ language:
14
+ - multilingual
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Emergent Semantics — Model_16_FLOAT (269M)
20
+
21
+ This repository provides **Model_16_FLOAT (269M)** — an **ablation model** from the paper:
22
+
23
+ - *Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations*
24
+
25
+ This checkpoint is designed to study the effect of **normalization / PCA-style processing** in a *minimal* frozen embedding setting.
26
+
27
+ Unlike **Model_UNI_GLYPH**, this model does **not** use glyph-based embeddings. Instead, it uses a **frozen 16-dimensional float embedding** per token.
28
+
29
+ ---
30
+
31
+ ## Key idea (what this ablation tests)
32
+
33
+ This model isolates the impact of having **float** frozen embeddings (with **PCA + normalization**) versus the strictly **binary token-ID** variant (**Model_16_BIT**):
34
+
35
+ - **`n_embed = 16`** per token (**float components**, not binary)
36
+ - Embedding vectors are **precomputed** (PCA + L2 normalization) and then **frozen**
37
+ - The embedding layer is never updated (`requires_grad=False`)
38
+ - To match the Transformer hidden size, the 16-dim embedding is expanded to 1024 via a **non-trainable repetition**:
39
+ `repeat_interleave(64)` → `16 * 64 = 1024`
40
+
41
+ This lets you test whether the model’s behavior changes when the frozen token “identifier” is:
42
+ - discrete + purely ID-like (**16-bit**), vs
43
+ - continuous + normalized (**16-float**)
44
+
45
+ ---
46
+
47
+ ## Important: parameter count difference (vs 335M models)
48
+
49
+ This checkpoint has **~269M parameters**, while models with a standard `n_embed=1024` embedding table (e.g. **UNI_GLYPH / unfrozen baselines**) are **~335M**.
50
+
51
+ This difference is expected and comes primarily from the embedding matrix size:
52
+
53
+ - Standard embedding params: `vocab_size * 1024 = 65536 * 1024 ≈ 67.1M`
54
+ - This model’s embedding params: `vocab_size * 16 = 65536 * 16 ≈ 1.0M`
55
+
56
+ So the **Transformer backbone is the same** (layers/heads/d_model), but the embedding table is much smaller, reducing total parameters.
57
+
58
+ ---
59
+
60
+ ## Model summary
61
+
62
+ - **Architecture:** decoder-only Transformer (GPT-like)
63
+ - **Hidden size (`d_model`):** 1024
64
+ - **Layers:** 16
65
+ - **Heads:** 32
66
+ - **Positional encoding:** rotary embeddings
67
+ - **Activation:** GELU
68
+ - **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
69
+ - **Input embeddings:** **frozen**, `n_embed=16` (**float**, PCA + L2 normalized), expanded to 1024 by repetition (non-trainable)
70
+ - **Output head:** **not tied** to the input embeddings (trained separately)
71
+
72
+ ---
73
+
74
+ ## Tokenizer
75
+
76
+ The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):
77
+
78
+ - https://huggingface.co/Bochkov/bvv241-2-3
79
+
80
+ You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is **exact vocab alignment**.
81
+
82
+ ---
83
+
84
+ ## How to use (Transformers)
85
+
86
+ ```python
87
+
88
+ import torch
89
+ from transformers import AutoTokenizer, AutoModelForCausalLM
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-16-float-269m")
92
+ model = AutoModelForCausalLM.from_pretrained(Bochkov/emergent-semantics-model-16-float-269m", trust_remote_code=True)
93
+
94
+ inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
95
+
96
+ outputs = model.generate(
97
+ inputs,
98
+ max_new_tokens=10,
99
+ do_sample=False
100
+ )
101
+ print(tokenizer.decode(outputs[0].tolist()))
102
+
103
+ ```
104
+
105
+ ---
106
+
107
+ ## Intended use
108
+
109
+ Research only, especially for:
110
+
111
+ - Comparing **Model_16_FLOAT** vs **Model_16_BIT** (effect of continuous normalized vectors vs binary ID)
112
+ - Comparing **Model_16_FLOAT** vs **Model_UNI_GLYPH** (effect of glyph-derived structure vs minimal vectors)
113
+ - Studying emergent semantics when embeddings are **frozen and non-semantic**
114
+
115
+ Not intended for production deployment.
116
+
117
+ ---
118
+
119
+ ## Related links
120
+
121
+ - **Model collection (paper artifacts):**
122
+ https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
123
+ - **UNI_GLYPH main model:**
124
+ https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m
125
+ - **16-bit ablation:**
126
+ https://huggingface.co/Bochkov/emergent-semantics-model-16-bit-269m
127
+ - **Tokenizer:**
128
+ https://huggingface.co/Bochkov/bvv241-2-3
129
+ - **Code (GitHub):**
130
+ https://github.com/AVBochkov/Embeddings
131
+
132
+ ---
133
+
134
+ ## 🧑‍🔬 Citation & Concept
135
+ If you use this model or the underlying concepts in your research, please cite our work:
136
+ ```
137
+ @article{
138
+ bochkov2025emergent,
139
+ title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
140
+ author={Andrey Bochkov},
141
+ journal={Transactions on Machine Learning Research},
142
+ issn={2835-8856},
143
+ year={2025},
144
+ url={https://openreview.net/forum?id=Odh8IynO1o},
145
+ note={}
146
+ }
147
+ @misc{bochkov2025growingtransformersmodularcomposition,
148
+ title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
149
+ author={A. Bochkov},
150
+ year={2025},
151
+ eprint={2507.07129},
152
+ archivePrefix={arXiv},
153
+ primaryClass={cs.LG},
154
+ url={https://arxiv.org/abs/2507.07129},
155
+ }
156
+ ```