metadata
license: apache-2.0
tags:
- text-to-image
- image-generation
- flextok
- autoregressive
GridAR-3B-T2I
Autoregressive Text-to-Image Model with 2D Grid Tokens (a Controlled Baseline for FlexTok)
Model Details
- Model Type: Autoregressive Text-to-Image Generation
- Architecture: Transformer Decoder with Cross-Attention
- Embedding Dimension: 2304
- Number of Blocks: 36
- Number of Heads: 36
- Image Resolution: 256x256
- Tokenizer: ZhitongGao/GridAR_256
- Text Encoder: google/flan-t5-xl
Installation
For install instructions, please see https://github.com/EPFL-VILAB/search-over-tokens/.
Usage
# Generate image from text prompt
from flextok_ar.utils.helpers import load_model
# Load model
model, tokenizer, cfg = load_model(
model_id="ZhitongGao/GridAR-3B-T2I",
device="cuda"
)
# Generate image
images = model.generate(
data_dict={"text": ["A serene lake at sunset"]},
cfg_factor=3.0,
temperature=1.0,
)
Citation
If you find this repository helpful, please consider citing our work:
@article{gao2026ordered,
title={(1D) Ordered Tokens Enable Efficient Test-Time Search},
author={Zhitong Gao and Parham Rezaei and Ali Cy and Mingqiao Ye and Nata{\v{s}}a Jovanovi{\'{c}} and Jesse Allardice and Afshin Dehghan and Amir Zamir and Roman Bachmann and O{\u{g}}uzhan Fatih Kar},
journal={arxiv 2026},
year={2026}
}
@article{flextok,
title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
journal={arXiv 2025},
year={2025}
}