| --- |
| license: mit |
| library_name: miro-t2i |
| tags: |
| - text-to-image |
| - diffusion |
| - flow-matching |
| - miro |
| - reward-conditioning |
| pipeline_tag: text-to-image |
| --- |
| |
| # MIRO (main) |
|
|
|  |
|
|
| <sub>Qualitative samples from the released MIRO checkpoint — same gallery as the |
| teaser of the [project page](https://nicolas-dufour.github.io/miro/).</sub> |
|
|
| **Main MIRO checkpoint.** Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions. |
|
|
| This checkpoint accompanies the paper |
| **MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency** |
| (Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026). |
|
|
| | | | |
| |---|---| |
| | **Paper** | <https://arxiv.org/abs/2510.25897> | |
| | **Project page** | <https://nicolas-dufour.github.io/miro/> | |
| | **Code** | <https://github.com/nicolas-dufour/miro> | |
| | **Demo** | [🤗 Space](https://huggingface.co/spaces/nicolas-dufour/miro) (Gradio + ZeroGPU) | |
| | **Parameters** | 360.4M | |
| | **Resolution** | 256×256 (SDXL VAE latent space) | |
| | **Architecture** | RIN flow-matching backbone, FLAN-T5-XL text conditioning | |
| | **Training data** | [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) + [LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5) (6.0+ aesthetic subset) | |
| | **Reward signals** | `clip_score`, `aesthetic_score`, `image_reward_score`, `pick_a_score_score`, `hpsv2_score`, `vqa_score`, `sciscore_score` | |
| | **Weights** | `model.safetensors`, **fp32** (EMA master weights — ready for finetuning) | |
|
|
| ## Install |
|
|
| ```bash |
| pip install miro-t2i |
| ``` |
|
|
| `miro-t2i` is the public PyPI package; it imports as `import miro`. The first |
| call to `MiroPipeline.from_pretrained(...)` will additionally fetch |
| [`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) (text encoder) |
| and [`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae) |
| (latent decoder) from the Hub. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from miro import MiroPipeline |
| |
| pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro") |
| pipe = pipe.to("cuda", torch.bfloat16) |
| |
| prompt = ( |
| "Photography closeup portrait of an adorable rusty brokendown steampunk " |
| "robot covered in budding vegetation, surrounded by tall grass, misty " |
| "futuristic scifi forest environment." |
| ) |
| image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0] |
| image.save("out.png") |
| ``` |
|
|
| ### Reward conditioning |
|
|
| MIRO conditions the flow model on a vector of reward targets in addition to the |
| text prompt. By default every reward is requested at its maximum (`1.0`); you |
| can override individual axes to bias generation toward a particular trade-off: |
|
|
| ```python |
| image = pipe( |
| prompt, # the rusty-robot prompt from above |
| reward_targets={ |
| "clip_score": 1.0, # strict prompt alignment |
| "aesthetic_score": 0.3, # de-prioritise prettiness |
| "image_reward_score": 1.0, # prioritise general human preference |
| # any reward not listed defaults to 1.0 |
| }, |
| negative_reward_targets={ |
| # zeros by default; what to push the unconditional branch toward |
| }, |
| guidance_scale=7.0, |
| )[0] |
| ``` |
|
|
| The seven reward dimensions are: |
|
|
| | Reward | Normalised range | What it measures | |
| |---|---|---| |
| | `clip_score` | ~[0, 1] | CLIP text–image alignment | |
| | `aesthetic_score` | ~[0, 1] | LAION aesthetic-quality predictor | |
| | `image_reward_score` | ~[0, 1] | ImageReward (general preference model) | |
| | `pick_a_score_score` | ~[0, 1] | PickScore (human preference) | |
| | `hpsv2_score` | ~[0, 1] | HPSv2 (human preference v2) | |
| | `vqa_score` | ~[0, 1] | VQAScore (compositional faithfulness) | |
| | `sciscore_score` | ~[0, 1] | SciScore (scientific-image plausibility) | |
|
|
| ## Reported benchmarks |
|
|
| The paper reports the following headline numbers for the **main MIRO** model |
| (this repo's `nicolas-dufour/miro`): |
|
|
| | Metric | MIRO (350M) | FLUX-dev (12B) | |
| |---|---|---| |
| | GenEval (overall) | **75** (with inference-time reward tuning) / 68 (default) | 67 | |
| | Inference compute | **1×** | ~370× | |
| | Aesthetic-metric convergence vs. baseline pretraining | **19×** faster | — | |
|
|
| Per-variant scores (GenEval, FID, individual reward scores) for the eight |
| ablations are reported in the paper's ablation tables. Please refer to |
| [arXiv:2510.25897](https://arxiv.org/abs/2510.25897) for the full breakdown. |
|
|
| ## Training compute and data |
|
|
| - **Default hardware**: 2 nodes × 8 H100 GPUs (16× H100, `16-mixed` precision) |
| - **Optimiser**: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2 |
| - **Batch size**: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0 |
| - **Steps**: 500 k (≈ ~29 epochs over the enriched training set) |
| - **Wall-clock on 16× H100**: ~52 hours (≈ 2.65 train it/s sustained) |
| - **8-GPU fallback**: 1 node × 8 H100 with `trainer.accumulate_grad_batches=2`, |
| measured at **≈ 1.45 train it/s** → ~96 hours (~4 days) end-to-end. |
| Requires `trainer.strategy.static_graph=false` and |
| `trainer.strategy.find_unused_parameters=true` to play well with the |
| self-conditioning skip in the loss; both flags are set automatically by |
| `miro/slurm/launch_multicad_synth_8gpu.py`. |
| - **Data**: [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) + |
| [LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5) |
| filtered to `aesthetic_score >= 6.0` (the higher-quality subset), encoded to |
| SDXL VAE latents at 256 resolution. Each sample is paired with seven reward |
| scores and FLAN-T5-XL embeddings of both the original and a synthetic |
| caption, computed by |
| [`miro/data/preprocess_data.py`](https://github.com/nicolas-dufour/miro/blob/main/data/preprocess_data.py). |
|
|
| ## Limitations and intended use |
|
|
| This checkpoint is a research artifact released to reproduce and build on the |
| MIRO paper. Known limitations: |
|
|
| - **Resolution**: 256×256 only. Higher-resolution outputs require upscaling. |
| - **Domain**: trained on web-scraped image–caption pairs (CC12M + LAION |
| Aesthetics 6.0). Inherits the biases of those datasets — including |
| under-representation of many cultures, languages, and concepts, and the |
| presence of stereotypes. Generations may reflect or amplify these biases. |
| - **Reward-model biases**: the seven reward predictors used during training |
| encode their own biases (e.g. aesthetic and human-preference models reflect |
| the taste of their annotator pools). Conditioning on these rewards inherits |
| and can sharpen those biases. |
| - **Not for safety-critical use**: outputs are not factual and the SciScore |
| reward does not guarantee scientific accuracy. |
| - **No safety filter** is shipped with the model; users deploying it in |
| user-facing settings should add their own. |
|
|
| The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL |
| encoder it depends on at inference time are loaded from |
| [`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae) and |
| [`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) and are |
| subject to their respective licenses. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{dufour2026miro, |
| title = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency}, |
| author = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David}, |
| booktitle = {International Conference on Machine Learning (ICML)}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT — see <https://github.com/nicolas-dufour/miro/blob/main/LICENSE>. |
|
|