Emergent Semantics β Model_UNFROZEN (335M) (Baseline)
This repository provides Model_UNFROZEN (335M) β a decoder-only Transformer language model trained in the standard setup with trainable input token embeddings.
It is released as the baseline for the paper:
Primary goal: enable a controlled comparison against the frozen-embedding variantBochkov/emergent-semantics-model-uni-glyph-335m under identical architecture, tokenizer, and training regime.
What this model is (and is not)
Model_UNFROZEN is a conventional Transformer LM where:
- the token embedding matrix is randomly initialized and trained end-to-end
- the rest of the Transformer is trained normally
This model exists to isolate the effect of freezing / changing the embedding layer.
It is not intended to be a best-performing standalone model.
Model summary
- Architecture: decoder-only Transformer (GPT-like)
- Hidden size (
d_model): 1024 - Layers: 16
- Heads: 32
- Positional encoding: rotary embeddings
- Activation: GELU
- Input embeddings: trainable (standard
nn.Embedding) - Output head: not tied to the input embeddings (trained separately)
- Vocabulary size: 65,536
- Tokenizer:
Bochkov/bvv241-2-3
Intended use
This model is intended for:
- baseline comparisons in research on emergent semantics
- measuring the effect of frozen vs trainable embeddings
- ablations and reproducibility checks for the associated paper
Not intended for production deployment. It is a research artifact trained under constrained compute/data to enable controlled comparisons.
How to use (Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-unfrozen-335m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-unfrozen-335m", trust_remote_code=True).to('cuda')
inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=10,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of Japan?
#Answer:Tokyo Metropolitan
Training overview (high level)
- Training data: multilingual Wikipedia subsets + a small portion of SFT-style QA data (see paper)
- Scale: ~4B tokens (resource-constrained setting for controlled comparisons)
- Hardware: H100 80GB (reported setup)
Related repositories
- Paper model collection:
https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings - Frozen embedding counterpart (main experimental model):
https://huggingface.co/Bochkov/emergent-semantics-model-uni-glyph-335m - Tokenizer:
https://huggingface.co/Bochkov/bvv241-2-3 - Code (GitHub):
https://github.com/AVBochkov/Embeddings
π§βπ¬ Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
@article{
bochkov2025emergent,
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
author={Andrey Bochkov},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Odh8IynO1o},
note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
- Downloads last month
- 19