Bochkov
/

emergent-semantics-model-64-float-272m

@@ -16,30 +16,31 @@ library_name: transformers
 pipeline_tag: text-generation
 ---
-# Emergent Semantics — Model_64_BIT (272M)
-This repository provides **Model_64_BIT (272M)** — an **ablation model** from the paper:
 [📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
 [📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
-This checkpoint is designed to test whether a Transformer can learn robust language behavior when the **entire input embedding layer is frozen** and contains **no semantic or visual signal**.
-Compared to **Model_16_BIT**, this model uses a larger frozen binary code (`n_embed=64`), but the codes are **randomly generated** rather than encoding the token index directly.
 ---
 ## Key idea (what this ablation tests)
-- Each token is assigned a **frozen 64-dimensional binary vector** (`n_embed=64`).
-- These vectors are **randomly generated**, but constructed to guarantee a **unique ID per token** (**no collisions by design**).
 - The embedding layer is **frozen** throughout training (`requires_grad = False`).
 To match the Transformer hidden size, the 64-dim embedding is expanded to 1024 via a **non-trainable repetition**:
 `repeat_interleave(16)` → `64 * 16 = 1024`.
-This makes the input compatible with the same `d_model=1024` Transformer backbone while ensuring the embedding table itself is purely a fixed identifier space.
 ---
@@ -65,23 +66,12 @@ So the Transformer backbone is the same, but the **embedding table is much small
 - **Positional encoding:** rotary embeddings
 - **Activation:** GELU
 - **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
-- **Input embeddings:** **frozen**, **binary**, `n_embed=64`, expanded to 1024 by repetition (non-trainable)
-- **Embedding initialization:** random binary codes with **unique per-token assignment (no collisions)**
 - **Output head:** **not tied** to the input embeddings (trained separately)
 ---
-## Files in this repo (embedding reference)
-For transparency and reproducibility, the explicit frozen embedding values are included in this repository.
-- `embeddings.txt` (human-readable reference; token → 64-bit vector):
-  `https://huggingface.co/Bochkov/emergent-semantics-model-64-float-272m/blob/main/embeddings.txt`
-> Note: Embeddings are shipped in this model repo (even though the tokenizer exists separately) to keep the model+embedding mapping self-contained and unambiguous.
----
 ## Tokenizer
 The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):
@@ -99,8 +89,8 @@ You may load the tokenizer either from this model repo (if included) or from the
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-64-bit-272m")
-model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-64-bit-272m", trust_remote_code=True)
 inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
@@ -120,8 +110,8 @@ print(tokenizer.decode(outputs[0].tolist()))
 This model is intended for **research only**, especially for:
 - Comparisons vs **Model_UNI_GLYPH (glyph/PCA frozen embeddings)** and vs **trainable-embedding baselines**
-- Studying whether semantic structure emerges in Transformer blocks when the input embedding space is a **random-but-unique identifier code**
-- Ablations on embedding dimensionality (`n_embed`) while keeping the Transformer backbone fixed
 Not intended for production deployment (no instruction tuning, safety tuning, or factuality guarantees).

 pipeline_tag: text-generation
 ---
+# Emergent Semantics — Model_64_FLOAT (272M)
+This repository provides **Model_64_FLOAT (272M)** — an **ablation model** from the paper:
 [📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
 [📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
+This checkpoint tests whether language modeling and semantic structure can emerge when the **entire input embedding layer is frozen** and contains **no semantic or glyph/visual information**.
+Compared to **Model_64_BIT**, this model uses the same embedding dimensionality (`n_embed=64`) and the same “unique per token” construction, but the embedding vectors are **floating-point** (after a deterministic projection/normalization step), rather than raw binary components.
 ---
 ## Key idea (what this ablation tests)
+- Each token is assigned a **frozen 64-dimensional float vector** (`n_embed=64`).
+- The vectors originate from **random per-token patterns** and are constructed to guarantee a **unique ID per token** (**no collisions by design**).
+- A deterministic post-processing step (e.g., PCA/projection + normalization) converts the raw patterns into **float embeddings** and standardizes their scale.
 - The embedding layer is **frozen** throughout training (`requires_grad = False`).
 To match the Transformer hidden size, the 64-dim embedding is expanded to 1024 via a **non-trainable repetition**:
 `repeat_interleave(16)` → `64 * 16 = 1024`.
+This keeps the Transformer backbone identical while isolating the role of embedding *trainability* and embedding *content*.
 ---
 - **Positional encoding:** rotary embeddings
 - **Activation:** GELU
 - **Tokenizer / vocab size:** 65,536 (bvv241-2-3 compatible)
+- **Input embeddings:** **frozen**, **float**, `n_embed=64`, expanded to 1024 by repetition (non-trainable)
+- **Embedding initialization:** random per-token patterns → deterministic projection/normalization → float vectors (**unique per token**, no collisions)
 - **Output head:** **not tied** to the input embeddings (trained separately)
 ---
 ## Tokenizer
 The intended tokenizer is **bvv241-2-3** (same vocab size and indexing):
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-64-float-272m")
+model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-64-float-272m", trust_remote_code=True)
 inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
 This model is intended for **research only**, especially for:
 - Comparisons vs **Model_UNI_GLYPH (glyph/PCA frozen embeddings)** and vs **trainable-embedding baselines**
+- Ablations comparing **binary vs float** frozen identifier embeddings at the same `n_embed`
+- Studying whether semantic structure emerges in Transformer blocks when the input embedding space is a **random-but-unique float code**
 Not intended for production deployment (no instruction tuning, safety tuning, or factuality guarantees).