Growing Transformers — Model 16_bit 1_9 (181M)

This repository contains growing-transformers-model-16-bit-1-9-181m, an ablation model from the paper:

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained with a constructive, layer-wise growth procedure (layers are added and trained progressively while previously trained layers are frozen). Unlike standard models, it uses a fully frozen, extremely low-dimensional token embedding: each token is represented by a 16-dimensional binary vector (n_embed = 16) derived from the token ID. Because vocab_size = 65,536, the full token ID fits exactly into 16 bits, hence the name “16-bit embedding”.

This model is designed to test how far emergent semantics can arise in deeper Transformer layers even when the input embedding is minimal and non-semantic.


Relationship to the UNICODE constructive model (main comparison)

This model is an ablation intended to be compared to:

  • Bochkov/growing-transformers-model-unicode-1-9-247m (constructive growth + frozen “visual UNICODE” embedding)

Both share the same Transformer stack architecture in the controlled study (same number of layers and same d_model / n_head), but differ in the embedding substrate.

Important: parameter count difference (why 181M vs 247M)

The total parameter count is smaller (≈181.6M) than the UNICODE / standard-embedding variants (≈247.6M) because the embedding layer is much smaller here (n_embed=16 instead of a full-size learned/frozen embedding at d_model). This reduces the embedding-matrix parameters substantially and therefore reduces the overall model size.


Embedding definition (16-bit / n_embed=16)

  • vocab_size = 65,536
  • Each token ID id ∈ [0, 65535] is represented as a 16-bit binary vector (0/1 components).
  • This 16-dim vector is then expanded to the model hidden size (d_model=1024) by simple repetition (repeat_interleave as described in the paper).

Reference file with full embedding vectors

For convenience / reproducibility, the full per-token embedding values are provided here:

(Useful if you want to verify exact token→vector mapping or re-implement the embedding layer.)


Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • Context length used in training: 1024
  • Embedding: frozen 16-dim binary (n_embed=16) + deterministic expansion to d_model (repeat_interleave)

Training method (constructive growth)

This model was trained in staged growth (as in the controlled study):

  1. Train layers 1–3, freeze them
  2. Add layers 4–6, train only new layers, freeze them
  3. Add layers 7–9, train only new layers

A key property: at later stages, most parameters are frozen and only the newly added top block(s) are trainable.


Tokenizer

This model uses the BVV tokenizer family. The canonical tokenizer repo is:

Note: even if the tokenizer is the same, this repository also includes the embedding artifacts specific to this model, so for exact reproducibility it is recommended to load the tokenizer from this model repo.


Intended use

Research / analysis of:

  • constructive (layer-wise) growth training
  • emergent semantics with non-semantic / frozen embeddings
  • comparisons vs richer frozen embeddings (UNICODE) and standard trainable embeddings

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training corpus.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-16-bit-1-9-181m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-16-bit-1-9-181m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The song was written by the band and was released on the same day as the album was release

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
#    </s><

Verify the 16-bit frozen binary embeddings (sanity check)

The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=16) whose values are strictly binary (0/1). Each 16-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=64).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Bochkov/growing-transformers-model-16-bit-1-9-181m"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})

# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 16)

# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)

tid = ids[0]

# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid]  # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())

uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())

is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)

# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale)  # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())

# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

  • vocab_size: 65536
  • config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 9, 'n_head': 32, 'scale': 64}
  • token_embeddings.weight shape: (65536, 16)
  • text='A'
  • ids: [65]
  • tokens: ['A']
  • 16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • unique values in e16: [0.0, 1.0]
  • is strictly binary (0/1): True
  • expanded embedding shape: (1024,)
  • expanded embedding first 128 values: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • is binary globally (0/1): True
  • non-binary entries: 0

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Bochkov/growing-transformers-model-16-bit-1-9-181m

Papers for Bochkov/growing-transformers-model-16-bit-1-9-181m