SphereAR / README.md

Add model card for SphereAR (#1)

9f03768 verified 4 months ago

3.76 kB

	---
	pipeline_tag: text-to-image
	---

	# SphereAR: Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

	This repository contains the official PyTorch implementation of the paper [Hyperspherical Latents Improve Continuous-Token Autoregressive Generation](https://huggingface.co/papers/2509.24335).

	<p align="center">
	<img src="https://github.com/guolinke/SphereAR/raw/main/figures/grid.jpg" width=780>
	</p>

	## Introduction

	<p align="center"><img src="https://github.com/guolinke/SphereAR/raw/main/figures/overview.png" width=553><img src="https://github.com/guolinke/SphereAR/raw/main/figures/fid_vs_params.png" width=246></p>

	SphereAR is a simple yet effective approach to continuous-token autoregressive (AR) image generation. It makes AR scale-invariant by constraining all AR inputs and outputs---including after CFG---to lie on a fixed-radius hypersphere (constant $\ell_2$ norm) via hyperspherical VAEs. This theoretical insight shows that the hyperspherical constraint removes the scale component, which is the primary cause of variance collapse, thereby stabilizing AR decoding.

	The model is a pure next-token AR generator with raster order, matching standard language AR modeling. On ImageNet 256×256, SphereAR-H (943M) achieves a state-of-the-art FID of 1.34 among AR image generators. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines.

	For more details on the implementation, environment setup, and advanced usage, please refer to the [official GitHub repository](https://github.com/guolinke/SphereAR).

	## Model Checkpoints

	Pre-trained model checkpoints are available on Hugging Face:

	\| Name \| params \| FID (256x256) \| weight \|
	\| :--------- \| :----: \| :-----------: \| :------------------------------------------------------------------------ \|
	\| S-VAE \| 75M \| - \| [vae.pt](https://huggingface.co/guolinke/SphereAR/blob/main/vae.pt) \|
	\| SphereAR-B \| 208M \| 1.92 \| [SphereAR_B.pt](https://huggingface.co/guolinke/SphereAR/blob/main/SphereAR_B.pt) \|
	\| SphereAR-L \| 479M \| 1.54 \| [SphereAR_L.pt](https://huggingface.co/guolinke/SphereAR/blob/main/SphereAR_L.pt) \|
	\| SphereAR-H \| 943M \| 1.34 \| [SphereAR_H.pt](https://huggingface.co/guolinke/SphereAR/blob/main/SphereAR_H.pt) \|

	## Sample Usage: Class-conditional Image Generation

	To sample 50,000 images using the `SphereAR-H` checkpoint for evaluation, you can use the following command adapted from the official repository. This requires a distributed setup (`torchrun`).

	```shell
	# First, download the SphereAR-H checkpoint (SphereAR_H.pt) and the S-VAE checkpoint (vae.pt)
	# from the links in the "Model Checkpoints" table above.

	ckpt=your_path_to/SphereAR_H.pt # Path to your downloaded SphereAR_H.pt checkpoint
	result_path=your_result_directory # Directory to save generated images

	torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
	sample_ddp.py --model SphereAR-H --ckpt $ckpt --cfg-scale 4.5 \
	--sample-dir $result_path --per-proc-batch-size 256 --to-npz
	```

	Note: The `sample_ddp.py` script and its dependencies can be found in the [official GitHub repository](https://github.com/guolinke/SphereAR). Ensure your environment is set up according to their instructions, including PyTorch and FlashAttention.

	## Citation

	If you find our work useful, please consider citing the paper:

	```bibtex
	@article{ke2025hyperspherical,
	title={Hyperspherical Latents Improve Continuous-Token Autoregressive Generation},
	author={Guolin Ke and Hui Xue},
	journal={arXiv preprint arXiv:2509.24335},
	year={2025}
	}
	```