Emergent Semantics β€” Model_16_BIT (269M)

This repository provides Model_16_BIT (269M) β€” an ablation model from the paper:

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Blog Article

The purpose of this checkpoint is to isolate whether trainable/semantic input embeddings are necessary at all.

Unlike Model_UNI_GLYPH, this model uses a frozen 16-dimensional binary embedding that contains no glyph / visual information β€” it effectively encodes only the token identity.


Key idea (what this ablation tests)

  • Vocabulary size is 65,536, so a token index fits exactly into 16 bits.
  • Each token is mapped to a fixed 16-bit vector (binary components), i.e. a deterministic token-ID representation.
  • This embedding layer is frozen for the entire training run.

To match the Transformer hidden size, the 16-dim embedding is expanded to 1024 via a non-trainable repetition: repeat_interleave(64) β†’ 16 * 64 = 1024.

So the model receives a full d_model=1024 input vector, but the learned semantic geometry is not coming from the embedding table.


Important: parameter count difference (vs 335M models)

This checkpoint has ~269M parameters, while models with a standard n_embed=1024 embedding table (e.g. UNI_GLYPH / unfrozen baselines) are ~335M.

This difference is expected and comes almost entirely from the embedding matrix size:

  • Standard embedding params: vocab_size * 1024 = 65536 * 1024 β‰ˆ 67.1M
  • This model’s embedding params: vocab_size * 16 = 65536 * 16 β‰ˆ 1.0M

So the model is architecturally identical in the Transformer backbone (layers/heads/d_model), but has a much smaller frozen embedding table, which reduces total parameter count.


Model summary

  • Architecture: decoder-only Transformer (GPT-like)
  • Hidden size (d_model): 1024
  • Layers: 16
  • Heads: 32
  • Positional encoding: rotary embeddings
  • Activation: GELU
  • Tokenizer / vocab size: 65,536 (bvv241-2-3 compatible)
  • Input embeddings: frozen, binary, n_embed=16, expanded to 1024 by repetition (non-trainable)
  • Output head: not tied to the input embeddings (trained separately)

Files in this repo (embedding reference)

To make this ablation fully inspectable/reproducible, the explicit frozen embedding values are provided:

This is convenient for verifying the exact token→vector mapping for each token ID.

Note: Embeddings are shipped in this model repo (even though the tokenizer exists as a separate HF repo) to keep the model+embedding mapping self-contained and unambiguous.


Tokenizer

The intended tokenizer is bvv241-2-3 (same vocab size and indexing):

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is exact vocab alignment.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-16-bit-269m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-16-bit-269m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:Nagano Prefecture

Verify the 16-bit frozen binary embeddings (sanity check)

The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=16) whose values are strictly binary (0/1). Each 16-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=64).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Bochkov/emergent-semantics-model-16-bit-269m"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})

# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 16)

# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)

tid = ids[0]

# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid]  # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())

uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())

is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)

# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale)  # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())

# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

  • vocab_size: 65536
  • config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 16, 'n_head': 32, 'scale': 64}
  • token_embeddings.weight shape: (65536, 16)
  • text='A'
  • ids: [65]
  • tokens: ['A']
  • 16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • unique values in e16: [0.0, 1.0]
  • is strictly binary (0/1): True
  • expanded embedding shape: (1024,)
  • expanded embedding first 128 values: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • is binary globally (0/1): True
  • non-binary entries: 0

Intended use

This model is intended for research only, especially for:

  • Controlled comparisons vs Model_UNI_GLYPH (glyph/PCA frozen embeddings) and vs trainable-embedding baselines
  • Studying whether semantic structure emerges in Transformer blocks even when the input embedding layer is purely structural / token-ID-like
  • Interpretable ablation experiments on embedding initialization and freezing

Not intended for production deployment (no instruction tuning, safety tuning, or factuality guarantees).


Related links


πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/emergent-semantics-model-16-bit-269m

Papers for Bochkov/emergent-semantics-model-16-bit-269m