File size: 11,726 Bytes

f5c1628

---

license: apache-2.0
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
    - Student
    - Research Graduate
    - AI researcher
    - AI developer/engineer
    - Reporter
    - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
tags:
- <relevant tags to be included in HF filters>
---


[![Static Badge](https://img.shields.io/badge/Project_Page-215650)](https://physics.allen-zhu.com/part-4-architecture-design/part-4-1)
[![Static Badge](https://img.shields.io/badge/Part_4.1-ssrn.5240330-b31b1b?logo=ssrn)](https://ssrn.com/abstract=5240330)
[![Static Badge](https://img.shields.io/badge/Part_4.1-2512.17351-b31b1b?logo=arxiv)](https://arxiv.org/abs/2512.17351)
[![Static Badge](https://img.shields.io/badge/Part_4.2-PhysicsLM4-181717?logo=github)](https://github.com/facebookresearch/PhysicsLM4)
[![Static Badge](https://img.shields.io/badge/HF-PhysicsLM4.2-FFD21E?logo=huggingface)](../../)

# Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality
## Transformer Model vs. Canon Layers --- LlamaCanon Release

Our released paper, [*Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers*](https://ssrn.com/abstract=5240330), demonstrates that the Canon layer is a powerful architecture add-on that improves language model performance on multiple fronts using a synthetic pretraining playground, perhaps for *every* possible architecture (original Transformer or linear models). 

In this release, we provide code and pre-trained models to showcase how these findings extend to real-world pretraining. Specifically, we compare the vanilla *Llama architecture* with our modified *LlamaCanon* variant, both pretrained under the same *controlled settings*.

<div align="center">
<img src="plots/model-training-time.png" style="object-fit: contain; display:inline-block;" />
<em><b>Figure 1:</b> Quick illustration of performance vs. model size/training time.</em>
</div>

## ✨Highlights of the Release

1. **Broad Model Availability**: We release **16 base models** (1B, 3B, and 8B) pretrained on the open-sourced [Nemotron-CC](https://research.nvidia.com/labs/adlr/Nemotron-CC/) dataset for 1T or 2T tokens.
2. **Controlled Experiment**: In each setting, we pretrain two versions of LlamaCanon (using two learning rates) and compare them against two corresponding versions of the original Llama pretrained with identical hyperparameters. This ensures a rigorous architectural comparison.
3. **Performance Gain**: LlamaCanon consistently surpasses Llama in all eight controlled comparisons, achieving, for instance, a 2% gain in the MMLU benchmark.
4. **Comparison to Open Models**: Our experiments are benchmarked against open-sourced models trained on similar datasets, ensuring that we study a *realistic pretraining setup* rather than an artificial scenario.

## ⚙️Model Configurations

A quick summary of the 16 models we release along with their parameters can be seen below:
<div align="center">
<img src="plots/table-params.png" style="object-fit: contain; width: 80%; "/>
<em><b>Figure 2:</b> Names and parameters of the released models.</em>
</div>

## 🔗Links

<div style="

  display: inline-block;

  transform: scale(0.9);

  transform-origin: top left;

  width: fit-content;

  white-space: nowrap;

">
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-1B-Nemo-2T-lr0.005">
  <img src="https://img.shields.io/badge/Llama-1B--Nemo--2T--lr0.005-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005">
  <img src="https://img.shields.io/badge/LlamaCanon-1B--Nemo--2T--lr0.005-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-3B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-3B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-3B--Nemo--1T--lr0.003-white">
</a>
<br/>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.002">
  <img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.002-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__Llama-8B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/Llama-8B--Nemo--1T--lr0.003-white">
</a>
<a href="https://huggingface.co/facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003">
  <img src="https://img.shields.io/badge/LlamaCanon-8B--Nemo--1T--lr0.003-white">
</a>
</div>

## 📊Performance Metrics

The table below illustrates how LlamaCanon performs in comparison to vanilla Llama models, as well as some open-sourced pretraining benchmarks. 
<div align="center">
<img src="plots/table-performance.png" style="object-fit: contain;"/>
<em><b>Figure 3:</b> Cross-benchmark performance evaluation of the released models.</em>
</div>

### 📈Training Curves

To further showcase the advantage of Canon layers over the entirety of the pretraining process, we provide detailed training-time performance curves. Interactive versions and additional benchmark metrics are available in our [GitHub repository](https://github.com/facebookresearch/PhysicsLM4/tree/main/lingua_results).
<div align="center">
<img src="plots/curve-mmlu.png" style="object-fit: contain;"/>
<em><b>Figure 4:</b> MMLU accuracy vs. training tokens.</em>
</div>

## 📌Model Details

- **Model Type:** Llama Transformer + LlamaCanon Transformer  
- **Language:** English  
- **License:** Apache 2.0  
- **Type:** Base model without any instruction fine-tuning or post-training.  
- **Context length:** 4096 tokens (+ ~50% for LlamaCanon).  
  - *Note*: The models were pretrained with context length 4096. However, unlike traditional RoPE transformers, LlamaCanon demonstrates strong length generalization, extending to ~50% more tokens (as detailed in [our paper](https://ssrn.com/abstract=5240330)). While long-context fine-tuning could further enhance this capability, we have deliberately avoided it to maintain a clean and controlled comparison of base-model pretraining, highlighting the effectiveness of Canon layers.

## 🧩Installation and Dependencies

It is highly recommended to `pip install causal-conv1d` for CUDA efficiency, as our implementation of Canon layers relies on depth-wise `conv1d`. 
The code is tested with `transformers==4.47.1` and `4.53.3` but should be compatible with many earlier versions. Ensure you enable `trust_remote_code=True` to download the architecture code automatically.

## ▶️Demo

The following sample demonstrates how to use our pre-trained models:

```python

from transformers import AutoTokenizer, AutoModelForCausalLM



# Choose any of our 16 released models

# model_name = "facebook/PhysicsLM4.2__LlamaCanon-8B-Nemo-1T-lr0.003"

model_name = "facebook/PhysicsLM4.2__LlamaCanon-1B-Nemo-2T-lr0.005"

# model_name = "facebook/PhysicsLM4.2__Llama-3B-Nemo-1T-lr0.003"



# Below is simply a wrapper of either the Llama2 tokenizer (for <=3B models) 

#   or Llama3 (for 8B models); alternatively, you can download your own 

#   Huggingface llama2/3 tokenizers and use that instead

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)



model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).cuda()



input_text = "Galileo Galilei climbed the Leaning Tower of Pisa to conduct a controlled experiment."

inputs = tokenizer(input_text, return_tensors="pt")

output_ids = model.generate(inputs['input_ids'].cuda(), max_new_tokens=50)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

```

## ⚠️Bias, Risks, and Limitations

The models are released for research purposes only (mainly for controlled experiments comparing Llama and LlamaCanon) and are not intended for applications requiring high factual accuracy, safety-critical use cases, or medical/health contexts. The models were pretrained on open datasets and are not safety- or alignment-tuned, meaning:

- They may generate content that is factually incorrect, biased, harmful, or offensive.
- Outputs may include objectionable content even if such outcomes weren't explicitly intended.
- Users are responsible for ensuring appropriate evaluation and implementing additional filtering or safety mechanisms suitable for their specific use cases.

---

## 📖Citation

Please cite the following if you use our models or findings in your research:
```bibtex

@inproceedings{Allen2025-canon,

  author = {{Allen-Zhu}, Zeyuan},

  title = {{Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers}},

  year = {2025},

  booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems},

  series = {NeurIPS~'25},

  note = {Full version available at \url{https://ssrn.com/abstract=5240330}} 

}

@misc{Allen2025-resonate,

    title = {{Physics of Language Models: Part 4.2, Canon Layers at Scale where Synthetic Pretraining Resonates in Reality}},

    author = {{Allen-Zhu}, Zeyuan},

    year = {2025},

    url = {https://physics.allen-zhu.com/part-4-architecture-design/part-4-2},

    note = {Code released at \url{https://github.com/facebookresearch/PhysicsLM4}},

}

```

## Additional Resources

- [GitHub Repository](https://github.com/facebookresearch/PhysicsLM4) includes
  - Full training recipes, model configurations, and interactive plots (on all benchmarks).  

## Model Card Author

- Zeyuan Allen-Zhu