Rethinking Training Dynamics in Scale-wise Autoregressive Generation

---
language:
- en
license: mit
tags:
- image-generation
- autoregressive
- next-scale-prediction
- exposure-bias
- post-training
- pytorch
- imagenet
library_name: pytorch
inference: false
model-index:
- name: ZGZzz/SAR
  results:
  - task:
      type: image-generation
      name: Image Generation
    dataset:
      name: ImageNet 256×256
      type: imagenet-1k
      config: 256x256
      split: validation
    metrics:
    - type: fid
      name: FID (FlexVAR-d16, +SAR)
      value: 2.89
      higher_is_better: false
    - type: fid
      name: FID (FlexVAR-d20, +SAR)
      value: 2.35
      higher_is_better: false
    - type: fid
      name: FID (FlexVAR-d24, +SAR)
      value: 2.14
      higher_is_better: false
datasets:
- ILSVRC/imagenet-1k
base_model:
- jiaosiyu1999/FlexVAR
pipeline_tag: text-to-image
---

<div align="center">
<h1>Rethinking Training Dynamics in Scale-wise Autoregressive Generation</h1>

<a href="https://gengzezhou.github.io/" target="_blank">Gengze Zhou</a><sup>1*</sup>,
<a href="https://chongjiange.github.io/" target="_blank">Chongjian Ge</a><sup>2</sup>,
<a href="https://www.cs.unc.edu/~airsplay/" target="_blank">Hao Tan</a><sup>2</sup>,
<a href="https://pages.cs.wisc.edu/~fliu/" target="_blank">Feng Liu</a><sup>2</sup>,
<a href="https://yiconghong.me" target="_blank">Yicong Hong</a><sup>2</sup>

<sup>1</sup>Australian Institute for Machine Learning, Adelaide University &nbsp;&nbsp;&nbsp;
<sup>2</sup>Adobe Research

[![arXiv](https://img.shields.io/badge/arXiv-2512.06421-b31b1b.svg)](https://arxiv.org/abs/2512.06421)&nbsp;
[![huggingface weights](https://img.shields.io/badge/%F0%9F%A4%97%20Weights-SAR--ckpts-yellow)](https://huggingface.co/ZGZzz/SAR)&nbsp;
[![project page](https://img.shields.io/badge/Project%20Page-SAR-blue)](https://gengzezhou.github.io/SAR)&nbsp;
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

</div>

## Model Description

**Self-Autoregressive Refinement (SAR)** is a lightweight *post-training* algorithm for **scale-wise autoregressive (AR)** image generation (next-scale prediction). SAR mitigates **exposure bias** by addressing (1) train–test mismatch (teacher forcing vs. student forcing) and (2) imbalance in scale-wise learning difficulty.

SAR consists of:
- **Stagger-Scale Rollout (SSR):** a two-step rollout (teacher-forcing → student-forcing) with minimal compute overhead (one extra forward pass).
- **Contrastive Student-Forcing Loss (CSFL):** stabilizes student-forced training by aligning predictions with a teacher trajectory under self-generated contexts.

## Key Features

- **Minimal overhead:** SSR adds only a lightweight additional forward pass to train on self-generated content.
- **General post-training recipe:** applies on top of pretrained scale-wise AR models.
- **Empirical gains:** e.g., reported **5.2% FID reduction** on FlexVAR-d16 with 10 SAR epochs.

## Model Zoo (ImageNet 256×256)

| Model | Params | Base FID ↓ | SAR FID ↓ | SAR Weights |
|---|---:|---:|---:|---|
| SAR-d16 | 310M | 3.05 | **2.89** | `pretrained/SARd16-epo179.pth` |
| SAR-d20 | 600M | 2.41 | **2.35** | `pretrained/SARd20-epo249.pth` |
| SAR-d24 | 1.0B | 2.21 | **2.14** | `pretrained/SARd24-epo349.pth` |

## How to Use

### Installation

```bash
git clone https://github.com/GengzeZhou/SAR.git
conda create -n sar python=3.10 -y
conda activate sar
pip install -r requirements.txt
# optional
pip install flash-attn xformers
```

### Sampling / Inference (Example)

```python
import torch
from models import build_vae_var
from torchvision.utils import save_image

device = "cuda" if torch.cuda.is_available() else "cpu"

# Build VAE + VAR backbone (example: depth=16)
vae, model = build_vae_var(
    V=8912, Cvae=32, device=device,
    num_classes=1000, depth=16,
    vae_ckpt="pretrained/FlexVAE.pth",
)

# Load SAR checkpoint
ckpt = torch.load("pretrained/SARd16-epo179.pth", map_location="cpu")
if "trainer" in ckpt:
    ckpt = ckpt["trainer"]["var_wo_ddp"]
model.load_state_dict(ckpt, strict=False)
model.eval()

with torch.no_grad():
    labels = torch.tensor([207, 88, 360, 387], device=device)  # example ImageNet classes
    images = model.autoregressive_infer_cfg(
        vqvae=vae,
        B=4,
        label_B=labels,
        cfg=2.5,
        top_k=900,
        top_p=0.95,
    )

save_image(images, "samples.png", normalize=True, value_range=(-1, 1), nrow=4)
```

## Training (SAR Post-Training)

```bash
bash scripts/train_SAR_d16.sh
bash scripts/train_SAR_d20.sh
bash scripts/train_SAR_d24.sh
```

## Evaluation

```bash
bash scripts/setup_eval.sh
bash scripts/eval_SAR_d16.sh
bash scripts/eval_SAR_d20.sh
bash scripts/eval_SAR_d24.sh
```

## Acknowledgements

This codebase builds upon **VAR** and **FlexVAR**.

## Citation

```bibtex
@article{zhou2025rethinking,
  title={Rethinking Training Dynamics in Scale-wise Autoregressive Generation},
  author={Zhou, Gengze and Ge, Chongjian and Tan, Hao and Liu, Feng and Hong, Yicong},
  journal={arXiv preprint arXiv:2512.06421},
  year={2025}
}
```