GridAR-3B-T2I

Autoregressive Text-to-Image Model with 2D Grid Tokens (a Controlled Baseline for FlexTok)

Model Details

Model Type: Autoregressive Text-to-Image Generation
Architecture: Transformer Decoder with Cross-Attention
Embedding Dimension: 2304
Number of Blocks: 36
Number of Heads: 36
Image Resolution: 256x256
Tokenizer: ZhitongGao/GridAR_256
Text Encoder: google/flan-t5-xl

Installation

For install instructions, please see https://github.com/EPFL-VILAB/search-over-tokens/.

Usage

# Generate image from text prompt
from flextok_ar.utils.helpers import load_model

# Load model
model, tokenizer, cfg = load_model(
    model_id="ZhitongGao/GridAR-3B-T2I",
    device="cuda"
)

# Generate image
images = model.generate(
    data_dict={"text": ["A serene lake at sunset"]},
    cfg_factor=3.0,
    temperature=1.0,
)

Citation

If you find this repository helpful, please consider citing our work:

@article{gao2026ordered,
  title={(1D) Ordered Tokens Enable Efficient Test-Time Search},
  author={Zhitong Gao and Parham Rezaei and Ali Cy and Mingqiao Ye and Nata{\v{s}}a Jovanovi{\'{c}} and Jesse Allardice and Afshin Dehghan and Amir Zamir and Roman Bachmann and O{\u{g}}uzhan Fatih Kar},
  journal={arxiv 2026},
  year={2026}
}

@article{flextok,
  title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
  author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
  journal={arXiv 2025},
  year={2025}
}

Downloads last month: 8

Safetensors

Model size

3B params

Tensor type

F32

Collection including EPFL-VILAB/GridAR-3B-T2I

AR Models with FlexTok

Collection

Autoregressive models trained on FlexTok. • 6 items • Updated May 5 • 1