SphereAR / README.md
nielsr's picture
nielsr HF Staff
Add model card for SphereAR
f7739e2 verified
|
raw
history blame
3.51 kB
metadata
pipeline_tag: text-to-image
library_name: pytorch

SphereAR: Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

This repository contains the official PyTorch implementation of the paper Hyperspherical Latents Improve Continuous-Token Autoregressive Generation.

The official code and further details can be found on the GitHub repository: https://github.com/guolinke/SphereAR

Abstract

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

Introduction

SphereAR is a simple yet effective approach to continuous-token autoregressive (AR) image generation: it makes AR scale-invariant by constraining all AR inputs and outputs---including after CFG---to lie on a fixed-radius hypersphere (constant L2 norm) via hyperspherical VAEs.

The model is a pure next-token AR generator with raster order, matching standard language AR modeling (i.e., it is not next-scale AR like VAR and not next-set AR like MAR/MaskGIT).

On ImageNet 256×256, SphereAR achieves a state-of-the-art FID of 1.34 among AR image generators.

Model Checkpoints

The following pre-trained models are available for class-conditional image generation on ImageNet:

Name params FID (256x256) weight
S-VAE 75M - vae.pt
SphereAR-B 208M 1.92 SphereAR_B.pt
SphereAR-L 479M 1.54 SphereAR_L.pt
SphereAR-H 943M 1.34 SphereAR_H.pt

For detailed instructions on evaluation and training using these checkpoints, please refer to the official GitHub repository.

Citation

If you find this work useful, please consider citing the paper:

@article{ke2025hyperspherical,
   title={Hyperspherical Latents Improve Continuous-Token Autoregressive Generation}, 
   author={Guolin Ke and Hui Xue},
   journal={arXiv preprint arXiv:2509.24335},
   year={2025}
}