File size: 5,808 Bytes
1dd4174 7926bce 1dd4174 5284338 1dd4174 5284338 1dd4174 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
---
license: apache-2.0
tags:
- text-generation
- causal-lm
- transformer
- research
- interpretability
- multilingual
- unicode
- frozen-embeddings
- ablation
language:
- multilingual
library_name: transformers
pipeline_tag: text-generation
---
# Emergent Semantics — Model_256_FLOAT (285M)
This repository provides **Model_256_FLOAT (285M)** — an **ablation model** from the paper:
[📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
[📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
[📚 Blog Article](https://huggingface.co/blog/Bochkov/emergent-semantics-beyond-token-embeddings)
This checkpoint isolates the effect of **floating-point / normalized frozen embeddings** (and the geometry they induce), while still keeping the embeddings **non-trainable** and **non-semantic**.
---
## Key idea (what this ablation tests)
This model is a close counterpart to **Model_256_BIT**, but the embedding vectors are **floats** rather than **binary**.
Pipeline (high-level):
1. Assign each token a **random unique code** (collision-free “unique ID per token” guaranteed by construction).
2. Convert the code into a vector representation.
3. Apply **PCA projection** to obtain a compact `n_embed = 256` representation.
4. Apply **L2 normalization** (so each token embedding has unit norm).
5. Freeze the embedding table (`requires_grad=False`) during training.
So **Model_256_FLOAT** tests whether improvements/convergence differences come from:
- simply having a stable token identifier (random, frozen), **or**
- additionally having a *continuous normalized geometry* (float values + normalization), even without any semantic or glyph information.
To match the Transformer hidden size, the 256-dim embedding is expanded to 1024 via a **non-trainable repetition**:
`repeat_interleave(4)` → `256 * 4 = 1024`.
---
## Important: parameter count difference (vs 335M models)
This checkpoint has **~285M parameters**, while models with a standard `n_embed=1024` embedding table (e.g. **UNI_GLYPH / unfrozen baselines**) are **~335M**.
The difference is primarily the embedding table size:
- Standard embedding params: `vocab_size * 1024 = 65536 * 1024 ≈ 67.1M`
- This model’s embedding params: `vocab_size * 256 = 65536 * 256 ≈ 16.8M`
The Transformer backbone is the same (layers/heads/d_model), but the total parameter count is lower because the embedding matrix is smaller.
---
## Model summary
- **Architecture:** decoder-only Transformer (GPT-like)
- **Hidden size (`d_model`):** 1024
- **Layers:** 16
- **Heads:** 32
- **Positional encoding:** rotary embeddings
- **Activation:** GELU
- **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
- **Input embeddings:** **frozen**, float, `n_embed=256`, derived from random unique IDs + **PCA + L2 normalization**, expanded to 1024 by repetition (non-trainable)
- **Output head:** **not tied** to the input embeddings (trained separately)
---
## Tokenizer
The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):
- https://huggingface.co/Bochkov/bvv241-2-3
You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is **exact vocab alignment**.
---
## How to use (Transformers)
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-256-float-285m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-256-float-285m", trust_remote_code=True).to('cuda')
inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=10,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of Japan?
#Answer:San Juan
```
---
## Intended use
This model is intended for **research only**, especially for:
- Comparing **binary vs float normalized** frozen embeddings under the same `n_embed`
- Studying whether **normalization / continuous geometry** affects convergence and reasoning benchmarks
- Controlled comparisons vs:
- **Model_256_BIT**
- **Model_UNI_GLYPH**
- trainable-embedding baselines
Not intended for production deployment.
---
## Related links
- **Model collection (paper artifacts):**
https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
- **UNI_GLYPH main model (frozen visual glyph embeddings):**
https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m
- **Tokenizer:**
https://huggingface.co/Bochkov/bvv241-2-3
- **Code (GitHub):**
https://github.com/AVBochkov/Embeddings
---
## 🧑🔬 Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
```
@article{
bochkov2025emergent,
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
author={Andrey Bochkov},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Odh8IynO1o},
note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
```
|