File size: 6,079 Bytes
ce8fde1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
586c93d
ce8fde1
586c93d
ce8fde1
 
 
 
 
2205fdc
 
586c93d
ce8fde1
586c93d
ce8fde1
 
 
 
 
586c93d
 
 
ce8fde1
 
 
 
 
586c93d
ce8fde1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
586c93d
 
ce8fde1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
586c93d
cb2b9cf
ce8fde1
 
 
 
 
 
 
 
 
 
cb2b9cf
 
 
 
ce8fde1
 
 
 
 
 
 
 
 
586c93d
 
ce8fde1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - transformer
  - research
  - interpretability
  - multilingual
  - unicode
  - frozen-embeddings
  - ablation
language:
  - multilingual
library_name: transformers
pipeline_tag: text-generation
---

# Emergent Semantics — Model_64_FLOAT (272M)

This repository provides **Model_64_FLOAT (272M)** — an **ablation model** from the paper:

[📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -

[📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -

[📚 Blog Article](https://huggingface.co/blog/Bochkov/emergent-semantics-beyond-token-embeddings)

This checkpoint tests whether language modeling and semantic structure can emerge when the **entire input embedding layer is frozen** and contains **no semantic or glyph/visual information**.

Compared to **Model_64_BIT**, this model uses the same embedding dimensionality (`n_embed=64`) and the same “unique per token” construction, but the embedding vectors are **floating-point** (after a deterministic projection/normalization step), rather than raw binary components.

---

## Key idea (what this ablation tests)

- Each token is assigned a **frozen 64-dimensional float vector** (`n_embed=64`).
- The vectors originate from **random per-token patterns** and are constructed to guarantee a **unique ID per token** (**no collisions by design**).
- A deterministic post-processing step (e.g., PCA/projection + normalization) converts the raw patterns into **float embeddings** and standardizes their scale.
- The embedding layer is **frozen** throughout training (`requires_grad = False`).

To match the Transformer hidden size, the 64-dim embedding is expanded to 1024 via a **non-trainable repetition**:
`repeat_interleave(16)``64 * 16 = 1024`.

This keeps the Transformer backbone identical while isolating the role of embedding *trainability* and embedding *content*.

---

## Important: parameter count difference (vs 335M models)

This checkpoint has **~272M parameters**, while models with a standard `n_embed=1024` embedding table (e.g. **UNI_GLYPH / unfrozen baselines**) are **~335M**.

The reduction is primarily due to the smaller embedding matrix:

- Standard embedding params: `vocab_size * 1024 = 65536 * 1024 ≈ 67.1M`
- This model’s embedding params: `vocab_size * 64 = 65536 * 64 ≈ 4.19M`

So the Transformer backbone is the same, but the **embedding table is much smaller**, lowering total parameter count.

---

## Model summary

- **Architecture:** decoder-only Transformer (GPT-like)
- **Hidden size (`d_model`):** 1024  
- **Layers:** 16  
- **Heads:** 32  
- **Positional encoding:** rotary embeddings  
- **Activation:** GELU  
- **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
- **Input embeddings:** **frozen**, **float**, `n_embed=64`, expanded to 1024 by repetition (non-trainable)
- **Embedding initialization:** random per-token patterns → deterministic projection/normalization → float vectors (**unique per token**, no collisions)
- **Output head:** **not tied** to the input embeddings (trained separately)

---

## Tokenizer

The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):

- https://huggingface.co/Bochkov/bvv241-2-3

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is **exact vocab alignment**.

---

## How to use (Transformers)

```python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-64-float-272m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-64-float-272m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:Japan
#    </s><|

```

---

## Intended use

This model is intended for **research only**, especially for:

- Comparisons vs **Model_UNI_GLYPH (glyph/PCA frozen embeddings)** and vs **trainable-embedding baselines**
- Ablations comparing **binary vs float** frozen identifier embeddings at the same `n_embed`
- Studying whether semantic structure emerges in Transformer blocks when the input embedding space is a **random-but-unique float code**

Not intended for production deployment (no instruction tuning, safety tuning, or factuality guarantees).

---

## Related links

- **Model collection (paper artifacts):**  
  https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
- **UNI_GLYPH main model (frozen visual glyph embeddings):**  
  https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m
- **Tokenizer collection:**  
  https://huggingface.co/collections/Bochkov/tokenizers
- **Code (GitHub):**  
  https://github.com/AVBochkov/Embeddings

---

## 🧑‍🔬 Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
```
@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
```