Improve model card: add pipeline tag, paper info, code link, and sample usage
Browse filesThis PR enhances the model card by:
- Adding the `pipeline_tag: text-to-image` to correctly categorize the model for discovery on the Hugging Face Hub.
- Including the paper title and a link to its official Hugging Face paper page.
- Providing the full abstract from the paper for comprehensive understanding.
- Adding a direct link to the official GitHub repository for easy access to the code.
- Incorporating a detailed sample usage section, including installation instructions and inference commands, directly from the official GitHub repository's README to guide users on how to run the model.
- Adding a visual representation of the model from the GitHub README.
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Go with Your Gut: Scaling Confidence for Autoregressive Image Generation
|
| 7 |
+
|
| 8 |
+
This repository contains the official implementation for the paper [Go with Your Gut: Scaling Confidence for Autoregressive Image Generation](https://huggingface.co/papers/2509.26376).
|
| 9 |
+
|
| 10 |
+
## Abstract
|
| 11 |
+
|
| 12 |
+
Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="https://github.com/EnVision-Research/ScalingAR/raw/main/asset/scalingar.png" alt="ScalingAR overview image">
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
**Code:** [https://github.com/EnVision-Research/ScalingAR](https://github.com/EnVision-Research/ScalingAR)
|
| 19 |
+
|
| 20 |
+
## Sample Usage
|
| 21 |
+
|
| 22 |
+
### Installation
|
| 23 |
+
|
| 24 |
+
1. Clone this repository and navigate to the source folder
|
| 25 |
+
```bash
|
| 26 |
+
git clone https://github.com/EnVision-Research/ScalingAR
|
| 27 |
+
cd ScalingAR
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
2. Build Environment
|
| 31 |
+
|
| 32 |
+
```Shell
|
| 33 |
+
echo "Creating conda environment"
|
| 34 |
+
conda create -n ScalingAR python=3.10
|
| 35 |
+
conda activate ScalingAR
|
| 36 |
+
|
| 37 |
+
echo "Installing dependencies"
|
| 38 |
+
pip install -r requirements.txt
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
### Inference
|
| 42 |
+
|
| 43 |
+
**LlamaGen**
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
PYTHONPATH=. python llamagen/sample_entropy.py --vq-ckpt ${VQ_CKPT} --gpt-ckpt ${LlamaGen_CKPT} --gpt-model GPT-XL --t5-path ${T5_PATH} --image-size 512
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
**AR-GRPO**
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
PYTHONPATH=. python AR_GRPO/sample_entropy.py --ckpt-path ${AR-GRPO_CKPT} --t5-path ${T5_PATH} --delay_load_text_encoder True --image-size 256
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Citation
|
| 56 |
+
|
| 57 |
+
Please consider citing our paper if our code is useful:
|
| 58 |
+
|
| 59 |
+
```bib
|
| 60 |
+
@article{chen2025go,
|
| 61 |
+
title={Go with Your Gut: Scaling Confidence for Autoregressive Image Generation},
|
| 62 |
+
author={Chen, Harold Haodong and Wu, Xianfeng and Shu, Wen-Jie and Guo, Rongjin and Lan, Disen and Yang, Harry and Chen, Ying-Cong},
|
| 63 |
+
journal={arXiv preprint arXiv:2509.26376},
|
| 64 |
+
year={2025}
|
| 65 |
+
}
|
| 66 |
+
```
|