File size: 5,656 Bytes
7b64028
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faff207
 
 
7b64028
e6ceb85
 
7b64028
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47046d3
7b64028
 
 
 
 
 
 
 
 
 
47046d3
 
 
7b64028
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - transformer
  - research
  - interpretability
  - multilingual
  - unicode
  - frozen-embeddings
  - ablation
language:
  - multilingual
library_name: transformers
pipeline_tag: text-generation
---

# Emergent Semantics — Model_16_FLOAT (269M)

This repository provides **Model_16_FLOAT (269M)** — an **ablation model** from the paper:

[📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -

[📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -

[📚 Blog Article](https://huggingface.co/blog/Bochkov/emergent-semantics-beyond-token-embeddings)

This checkpoint is designed to study the effect of **normalization / PCA-style processing** in a *minimal* frozen embedding setting.

Unlike **Model_UNI_GLYPH**, this model does **not** use glyph-based embeddings. Instead, it uses a **frozen 16-dimensional float embedding** per token.

---

## Key idea (what this ablation tests)

This model isolates the impact of having **float** frozen embeddings (with **PCA + normalization**) versus the strictly **binary token-ID** variant (**Model_16_BIT**):

- **`n_embed = 16`** per token (**float components**, not binary)
- Embedding vectors are **precomputed** (PCA + L2 normalization) and then **frozen**
- The embedding layer is never updated (`requires_grad=False`)
- To match the Transformer hidden size, the 16-dim embedding is expanded to 1024 via a **non-trainable repetition**:
  `repeat_interleave(64)` → `16 * 64 = 1024`

This lets you test whether the model’s behavior changes when the frozen token “identifier” is:
- discrete + purely ID-like (**16-bit**), vs
- continuous + normalized (**16-float**)

---

## Important: parameter count difference (vs 335M models)

This checkpoint has **~269M parameters**, while models with a standard `n_embed=1024` embedding table (e.g. **UNI_GLYPH / unfrozen baselines**) are **~335M**.

This difference is expected and comes primarily from the embedding matrix size:

- Standard embedding params: `vocab_size * 1024 = 65536 * 1024 ≈ 67.1M`
- This model’s embedding params: `vocab_size * 16 = 65536 * 16 ≈ 1.0M`

So the **Transformer backbone is the same** (layers/heads/d_model), but the embedding table is much smaller, reducing total parameters.

---

## Model summary

- **Architecture:** decoder-only Transformer (GPT-like)
- **Hidden size (`d_model`):** 1024  
- **Layers:** 16  
- **Heads:** 32  
- **Positional encoding:** rotary embeddings  
- **Activation:** GELU  
- **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
- **Input embeddings:** **frozen**, `n_embed=16` (**float**, PCA + L2 normalized), expanded to 1024 by repetition (non-trainable)
- **Output head:** **not tied** to the input embeddings (trained separately)

---

## Tokenizer

The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):

- https://huggingface.co/Bochkov/bvv241-2-3

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is **exact vocab alignment**.

---

## How to use (Transformers)

```python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-16-float-269m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-16-float-269m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:A temperature in

```

---

## Intended use

Research only, especially for:

- Comparing **Model_16_FLOAT** vs **Model_16_BIT** (effect of continuous normalized vectors vs binary ID)
- Comparing **Model_16_FLOAT** vs **Model_UNI_GLYPH** (effect of glyph-derived structure vs minimal vectors)
- Studying emergent semantics when embeddings are **frozen and non-semantic**

Not intended for production deployment.

---

## Related links

- **Model collection (paper artifacts):**  
  https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
- **UNI_GLYPH main model:**  
  https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m
- **16-bit ablation:**  
  https://huggingface.co/Bochkov/emergent-semantics-model-16-bit-269m
- **Tokenizer:**  
  https://huggingface.co/Bochkov/bvv241-2-3
- **Code (GitHub):**  
  https://github.com/AVBochkov/Embeddings

---

## 🧑‍🔬 Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
```
@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
```