FlexAR-1B-C2I

Autoregressive Class-to-Image Model trained on FlexTok.

Model Details

  • Model Type: Autoregressive Class-to-Image Generation
  • Architecture: Transformer Decoder (Class-Conditional)
  • Embedding Dimension: 1920
  • Number of Blocks: 30
  • Number of Heads: 30
  • Number of Classes: 1000 (ImageNet)
  • Image Resolution: 256x256
  • Tokenizer: EPFL-VILAB/flextok_d18_d28_in1k
  • Text Encoder: N/A (Class-conditional, no text encoder)

Installation

For install instructions, please see https://github.com/EPFL-VILAB/search-over-tokens/.

Usage

# Generate image from ImageNet class label
from flextok_ar.utils.helpers import load_model

# Load model
model, tokenizer, cfg = load_model(
    model_id="ZhitongGao/FlexAR-3B-C2I",
    device="cuda"
)

# Generate image from class label (0-999 for ImageNet)
# Example: 285 = Egyptian cat
images = model.generate(
    data_dict={"target": [285]},
    cfg_factor=1.5,
    temperature=1.0,
)

Citation

If you find this repository helpful, please consider citing our work:


@article{gao2026ordered,
  title={(1D) Ordered Tokens Enable Efficient Test-Time Search},
  author={Zhitong Gao and Parham Rezaei and Ali Cy and Mingqiao Ye and Nata{\v{s}}a Jovanovi{\'{c}} and Jesse Allardice and Afshin Dehghan and Amir Zamir and Roman Bachmann and O{\u{g}}uzhan Fatih Kar}
  journal={arxiv 2026},
  year={2026}
}

@article{flextok,
  title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
  author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
  journal={arXiv 2025},
  year={2025},
}
Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EPFL-VILAB/FlexAR-1B-C2I