File size: 11,726 Bytes
f5c1628 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | ---
license: apache-2.0
extra_gated_fields:
First Name: text
Last Name: text
Date of birth: date_picker
Country: country
Affiliation: text
Job title:
type: select
options:
- Student
- Research Graduate
- AI researcher
- AI developer/engineer
- Reporter
- Other
geo: ip_location
By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
The information you provide will be collected, stored, processed and shared in
accordance with the [Meta Privacy
Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
tags:
- <relevant tags to be included in HF filters>
---
[](https://physics.allen-zhu.com/part-4-architecture-design/part-4-1)
[](https://ssrn.com/abstract=5240330)
[](https://arxiv.org/abs/2512.17351)
[](https://github.com/facebookresearch/PhysicsLM4)
[](../../)
# Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality
## Transformer Model vs. Canon Layers --- LlamaCanon Release
Our released paper, [*Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers*](https://ssrn.com/abstract=5240330), demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for *every* possible architecture (original Transformer or linear models).
In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla *Llama architecture* with our modified *LlamaCanon* variant, both pretrained under the same *controlled settings*.
<div align="center">
<img src="plots/model-training-time.png" style="object-fit: contain; display:inline-block;" />
<em><b>Figure 1:</b> Quick illustration of performance vs. model size/training time.</em>
</div>
## ✨Highlights of the Release
1. **Broad Model Availability**: We release **16 base models** (1B, 3B, and 8B) pretrained on the open-sourced [Nemotron-CC](https://research.nvidia.com/labs/adlr/Nemotron-CC/) dataset for 1T or 2T tokens.
2. **Controlled Experiment**: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
3. **Performance Gain**: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
4. **Comparison to Open Models**: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a *realistic pretraining setup* rather than an artificial scenario.
## ⚙️Model Configurations
A quick summary of the 16 models we release along with their parameters can be seen below:
<div align="center">
<img src="plots/table-params.png" style="object-fit: contain; width: 80%; "/>
<em><b>Figure 2:</b> Names and parameters of the released models.</em>
</div>
## 🔗Links
<div style="
display: inline-block;
transform: scale(0.9);
transform-origin: top left;
width: fit-content;
white-space: nowrap;
">
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.003">
<img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.003">
<img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.005">
<img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.005-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005">
<img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.005-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.002">
<img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003">
<img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.003-white">
</a>
</div>
## 📊Performance Metrics
The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks.
<div align="center">
<img src="plots/table-performance.png" style="object-fit: contain;"/>
<em><b>Figure 3:</b> Cross-benchmark performance evaluation of the released models.</em>
</div>
### 📈Training Curves
To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our [GitHub repository](https://github.com/facebookresearch/PhysicsLM4/tree/main/lingua_results).
<div align="center">
<img src="plots/curve-mmlu.png" style="object-fit: contain;"/>
<em><b>Figure 4:</b> MMLU accuracy vs. training tokens.</em>
</div>
## 📌Model Details
- **Model Type:** Llama Transformer + LlamaCanon Transformer
- **Language:** English
- **License:** Apache 2.0
- **Type:** Base model without any instruction fine-tuning or post-training.
- **Context length:** 4096 tokens (+ ~50% for LlamaCanon).
- *Note*: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in [our paper](https://ssrn.com/abstract=5240330)). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.
## 🧩Installation and Dependencies
It is highly recommended to `pip install causal-conv1d` for CUDA efficiency, as our implementation of Canon layers relies on depth-wise `conv1d`.
The code is tested with `transformers==4.47.1` and `4.53.3` but should be compatible with many earlier versions. Ensure you enable `trust_remote_code=True` to download the architecture code automatically.
## ▶️Demo
The following sample demonstrates how to use our pre-trained models:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Choose any of our 16 released models
# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"
model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"
# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"
# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models)
# or Llama3 (for 8B models); alternatively, you can download your own
# Huggingface llama2/3 tokenizers and use that instead
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()
input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."
inputs = tokenizer(input_text, return_tensors="pt")
output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```
## ⚠️Bias, Risks, and Limitations
The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:
- They may generate content that is factually incorrect, biased, harmful, or offensive.
- Outputs may include objectionable content even if such outcomes weren't explicitly intended.
- Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.
---
## 📖Citation
Please cite the following if you use our models or findings in your research:
```bibtex
@inproceedings{Allen2025-canon,
author = {{Allen-Zhu}, Zeyuan},
title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},
year = {2025},
booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems},
series = {NeurIPS~'25},
note = {Full version available at \url{https://ssrn.com/abstract=5240330}}
}
@misc{Allen2025-resonate,
title = {{Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality}},
author = {{Allen-Zhu}, Zeyuan},
year = {2025},
url = {https://physics.allen-zhu.com/part-4-architecture-design/part-4-2},
note = {Code released at \url{https://github.com/facebookresearch/PhysicsLM4}},
}
```
## Additional Resources
- [GitHub Repository](https://github.com/facebookresearch/PhysicsLM4) includes
- Full training recipes, model configurations, and interactive plots (on all benchmarks).
## Model Card Author
- Zeyuan Allen-Zhu
|