File size: 5,116 Bytes
b7ed6b3
 
57116dc
 
b7ed6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
cbe258a
b7ed6b3
2aec66c
b7ed6b3
 
 
 
 
 
 
 
cbe258a
57116dc
b7ed6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16120df
 
b7ed6b3
16120df
 
74d28e9
 
 
 
 
 
16120df
74d28e9
 
b7ed6b3
49d8814
 
 
 
 
 
 
 
 
 
16120df
b7ed6b3
16120df
b7ed6b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57116dc
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- transformer
- causal-lm
- progressive-growth
- constructive-learning
- frozen-embeddings
- bvv
---

# Model Card for abs-bvv-2

## Model Description

`abs-bvv-2` is a 1.5 billion parameter decoder-only Transformer model. It is the second model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
This model is presented in the paper [Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate](https://huggingface.co/papers/2507.07129).

This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the paper "[Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://arxiv.org/abs/2507.04886)".

The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
1.  Semantic understanding can emerge without trainable embeddings.
2.  Complex reasoning abilities are a direct result of compositional depth.
3.  Models can be built incrementally, much like a living organism grows, rather than being forged all at once.

`abs-bvv-2` represents the state of the model after 2 layers of progressive training. It has 2 Transformer blocks, a hidden dimension of 4096, and uses the `bvv241` tokenizer family.

**Code:** [https://github.com/AVBochkov/PGT](https://github.com/AVBochkov/PGT)

## Intended Use

This model is primarily an artifact for research into emergent capabilities, constructive learning, and the role of embeddings in LLMs. It can be used for text generation, but it is not fine-tuned for specific downstream tasks and may produce unpredictable outputs. It is suitable for exploring the raw capabilities of a model trained under this novel paradigm.

## Training Details
Architecture: 2-layer Decoder-Only Transformer (n_layer=2, d_model=4096, n_head=32).

Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.

Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.

Parameters: Total: 1.5B.

Data: A ~9B token mix of Wikipedia and SFT datasets (10%).

## Limitations and Bias

This model is a research prototype and has several limitations:

Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.

Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.

Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.

Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.

## 🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

```
@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
```

This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.

## How to Use

The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-2') 
model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-2', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')

inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')

# Generate text
outputs = model.generate(
    **inputs, 
    max_new_tokens=100, 
    temperature=0.8, 
    top_k=50, 
    top_p=0.95, 
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```