File size: 11,726 Bytes
f5c1628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---

license: apache-2.0
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
    - Student
    - Research Graduate
    - AI researcher
    - AI developer/engineer
    - Reporter
    - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
tags:
- <relevant tags to be included in HF filters>
---


[![Static Badge](https://img.shields.io/badge/Project_Page-215650)](https://physics.allen-zhu.com/part-4-architecture-design/part-4-1)
[![Static Badge](https://img.shields.io/badge/Part_4.1-ssrn.5240330-b31b1b?logo=ssrn)](https://ssrn.com/abstract=5240330)
[![Static Badge](https://img.shields.io/badge/Part_4.1-2512.17351-b31b1b?logo=arxiv)](https://arxiv.org/abs/2512.17351)
[![Static Badge](https://img.shields.io/badge/Part_4.2-PhysicsLM4-181717?logo=github)](https://github.com/facebookresearch/PhysicsLM4)
[![Static Badge](https://img.shields.io/badge/HF-PhysicsLM4.2-FFD21E?logo=huggingface)](../../)

# Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality
## Transformer Model vs. Canon Layers --- LlamaCanon Release

Our released paper, [*Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers*](https://ssrn.com/abstract=5240330), demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for *every* possible architecture (original Transformer or linear models). 

In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla *Llama architecture* with our modified *LlamaCanon* variant, both pretrained under the same *controlled settings*.

<div align="center">
<img src="plots/model-training-time.png" style="object-fit: contain; display:inline-block;" />
<em><b>Figure 1:</b> Quick illustration of performance vs. model size/training time.</em>
</div>

## ✨Highlights of the Release

1. **Broad Model Availability**: We release **16 base models** (1B, 3B, and 8B) pretrained on the open-sourced [Nemotron-CC](https://research.nvidia.com/labs/adlr/Nemotron-CC/) dataset for 1T or 2T tokens.
2. **Controlled Experiment**: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
3. **Performance Gain**: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
4. **Comparison to Open Models**: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a *realistic pretraining setup* rather than an artificial scenario.

## ⚙️Model Configurations

A quick summary of the 16 models we release along with their parameters can be seen below:
<div align="center">
<img src="plots/table-params.png" style="object-fit: contain; width: 80%; "/>
<em><b>Figure 2:</b> Names and parameters of the released models.</em>
</div>

## 🔗Links

<div style="

  display: inline-block;

  transform: scale(0.9);

  transform-origin: top left;

  width: fit-content;

  white-space: nowrap;

">
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.005">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.005-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.005-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.003-white">
</a>
</div>

## 📊Performance Metrics

The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks. 
<div align="center">
<img src="plots/table-performance.png" style="object-fit: contain;"/>
<em><b>Figure 3:</b> Cross-benchmark performance evaluation of the released models.</em>
</div>

### 📈Training Curves

To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our [GitHub repository](https://github.com/facebookresearch/PhysicsLM4/tree/main/lingua_results).
<div align="center">
<img src="plots/curve-mmlu.png" style="object-fit: contain;"/>
<em><b>Figure 4:</b> MMLU accuracy vs. training tokens.</em>
</div>

## 📌Model Details

- **Model Type:** Llama Transformer + LlamaCanon Transformer  
- **Language:** English  
- **License:** Apache 2.0  
- **Type:** Base model without any instruction fine-tuning or post-training.  
- **Context length:** 4096 tokens (+ ~50% for LlamaCanon).  
  - *Note*: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in [our paper](https://ssrn.com/abstract=5240330)). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.

## 🧩Installation and Dependencies

It is highly recommended to `pip install causal-conv1d` for CUDA efficiency, as our implementation of Canon layers relies on depth-wise `conv1d`. 
The code is tested with `transformers==4.47.1` and `4.53.3` but should be compatible with many earlier versions. Ensure you enable `trust_remote_code=True` to download the architecture code automatically.

## ▶️Demo

The following sample demonstrates how to use our pre-trained models:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM



# Choose any of our 16 released models

# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"

model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"

# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"



# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models) 

#   or Llama3 (for 8B models); alternatively, you can download your own 

#   Huggingface llama2/3 tokenizers and use that instead

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)



model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()



input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."

inputs = tokenizer(input_text, return_tensors="pt")

output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

```

## ⚠️Bias, Risks, and Limitations

The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:

- They may generate content that is factually incorrect, biased, harmful, or offensive.
- Outputs may include objectionable content even if such outcomes weren't explicitly intended.
- Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.

---

## 📖Citation

Please cite the following if you use our models or findings in your research:
```bibtex

@inproceedings{Allen2025-canon,

  author = {{Allen-Zhu}, Zeyuan},

  title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},

  year = {2025},

  booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems},

  series = {NeurIPS~'25},

  note = {Full version available at \url{https://ssrn.com/abstract=5240330}} 

}

@misc{Allen2025-resonate,

    title = {{Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality}},

    author = {{Allen-Zhu}, Zeyuan},

    year = {2025},

    url = {https://physics.allen-zhu.com/part-4-architecture-design/part-4-2},

    note = {Code released at \url{https://github.com/facebookresearch/PhysicsLM4}},

}

```

## Additional Resources

- [GitHub Repository](https://github.com/facebookresearch/PhysicsLM4) includes
  - Full training recipes, model configurations, and interactive plots (on all benchmarks).  

## Model Card Author

- Zeyuan Allen-Zhu