--- datasets: - ILSVRC/imagenet-1k - ljnlonoljpiljm/places365-256px language: - en - zh license: mit pipeline_tag: class-conditional-image-generation library_name: pytorch --- [![arXiv](https://img.shields.io/badge/arXiv%20paper-2503.18948-b31b1b.svg)](https://arxiv.org/abs/2503.18948)  This is an official model card of the paper [Equivariant Image Modeling](https://arxiv.org/abs/2503.18948).

In this paper, we propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks in autoregressive image modeling by leveraging the translation invariance of natural visual signals. Our method introduces: * Column-wise tokenization which enhances translational symmetry along the horizontal axis. * Autoregressive generative models using windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256×256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Moreover, our approach significantly improving zero-shot generalization and enabling ultra-long image synthesis. ## Bibtex ```bibtex @misc{dong2025equivariantimagemodeling, title={Equivariant Image Modeling}, author={Ruixiao Dong and Mengde Xu and Zigang Geng and Li Li and Han Hu and Shuyang Gu}, year={2025}, eprint={2503.18948}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.18948}, } ```