GridAR-3B-T2I

Autoregressive Text-to-Image Model with 2D Grid Tokens (a Controlled Baseline for FlexTok)

Model Details

  • Model Type: Autoregressive Text-to-Image Generation
  • Architecture: Transformer Decoder with Cross-Attention
  • Embedding Dimension: 2304
  • Number of Blocks: 36
  • Number of Heads: 36
  • Image Resolution: 256x256
  • Tokenizer: ZhitongGao/GridAR_256
  • Text Encoder: google/flan-t5-xl

Installation

For install instructions, please see https://github.com/EPFL-VILAB/search-over-tokens/.

Usage

# Generate image from text prompt
from flextok_ar.utils.helpers import load_model

# Load model
model, tokenizer, cfg = load_model(
    model_id="ZhitongGao/GridAR-3B-T2I",
    device="cuda"
)

# Generate image
images = model.generate(
    data_dict={"text": ["A serene lake at sunset"]},
    cfg_factor=3.0,
    temperature=1.0,
)

Citation

If you find this repository helpful, please consider citing our work:

@article{gao2026ordered,
  title={(1D) Ordered Tokens Enable Efficient Test-Time Search},
  author={Zhitong Gao and Parham Rezaei and Ali Cy and Mingqiao Ye and Nata{\v{s}}a Jovanovi{\'{c}} and Jesse Allardice and Afshin Dehghan and Amir Zamir and Roman Bachmann and O{\u{g}}uzhan Fatih Kar},
  journal={arxiv 2026},
  year={2026}
}

@article{flextok,
  title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
  author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
  journal={arXiv 2025},
  year={2025}
}
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EPFL-VILAB/GridAR-3B-T2I