---
license: mit
library_name: miro-t2i
tags:
  - text-to-image
  - diffusion
  - flow-matching
  - miro
  - reward-conditioning
pipeline_tag: text-to-image
---

# MIRO (main)

![Qualitative samples from MIRO](teaser.jpg)

<sub>Qualitative samples from the released MIRO checkpoint — same gallery as the
teaser of the [project page](https://nicolas-dufour.github.io/miro/).</sub>

**Main MIRO checkpoint.** Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions.

This checkpoint accompanies the paper
**MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency**
(Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026).

| | |
|---|---|
| **Paper** | <https://arxiv.org/abs/2510.25897> |
| **Project page** | <https://nicolas-dufour.github.io/miro/> |
| **Code** | <https://github.com/nicolas-dufour/miro> |
| **Demo** | [🤗 Space](https://huggingface.co/spaces/nicolas-dufour/miro) (Gradio + ZeroGPU) |
| **Parameters** | 360.4M |
| **Resolution** | 256×256 (SDXL VAE latent space) |
| **Architecture** | RIN flow-matching backbone, FLAN-T5-XL text conditioning |
| **Training data** | [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) + [LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5) (6.0+ aesthetic subset) |
| **Reward signals** | `clip_score`, `aesthetic_score`, `image_reward_score`, `pick_a_score_score`, `hpsv2_score`, `vqa_score`, `sciscore_score` |
| **Weights** | `model.safetensors`, **fp32** (EMA master weights — ready for finetuning) |

## Install

```bash
pip install miro-t2i
```

`miro-t2i` is the public PyPI package; it imports as `import miro`. The first
call to `MiroPipeline.from_pretrained(...)` will additionally fetch
[`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) (text encoder)
and [`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae)
(latent decoder) from the Hub.

## Usage

```python
import torch
from miro import MiroPipeline

pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro")
pipe = pipe.to("cuda", torch.bfloat16)

prompt = (
    "Photography closeup portrait of an adorable rusty broken­down steampunk "
    "robot covered in budding vegetation, surrounded by tall grass, misty "
    "futuristic sci­fi forest environment."
)
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0]
image.save("out.png")
```

### Reward conditioning

MIRO conditions the flow model on a vector of reward targets in addition to the
text prompt. By default every reward is requested at its maximum (`1.0`); you
can override individual axes to bias generation toward a particular trade-off:

```python
image = pipe(
    prompt,                       # the rusty-robot prompt from above
    reward_targets={
        "clip_score": 1.0,        # strict prompt alignment
        "aesthetic_score": 0.3,   # de-prioritise prettiness
        "image_reward_score": 1.0,  # prioritise general human preference
        # any reward not listed defaults to 1.0
    },
    negative_reward_targets={
        # zeros by default; what to push the unconditional branch toward
    },
    guidance_scale=7.0,
)[0]
```

The seven reward dimensions are:

| Reward | Normalised range | What it measures |
|---|---|---|
| `clip_score` | ~[0, 1] | CLIP text–image alignment |
| `aesthetic_score` | ~[0, 1] | LAION aesthetic-quality predictor |
| `image_reward_score` | ~[0, 1] | ImageReward (general preference model) |
| `pick_a_score_score` | ~[0, 1] | PickScore (human preference) |
| `hpsv2_score` | ~[0, 1] | HPSv2 (human preference v2) |
| `vqa_score` | ~[0, 1] | VQAScore (compositional faithfulness) |
| `sciscore_score` | ~[0, 1] | SciScore (scientific-image plausibility) |

## Reported benchmarks

The paper reports the following headline numbers for the **main MIRO** model
(this repo's `nicolas-dufour/miro`):

| Metric | MIRO (350M) | FLUX-dev (12B) |
|---|---|---|
| GenEval (overall) | **75** (with inference-time reward tuning) / 68 (default) | 67 |
| Inference compute | **1×** | ~370× |
| Aesthetic-metric convergence vs. baseline pretraining | **19×** faster | — |

Per-variant scores (GenEval, FID, individual reward scores) for the eight
ablations are reported in the paper's ablation tables. Please refer to
[arXiv:2510.25897](https://arxiv.org/abs/2510.25897) for the full breakdown.

## Training compute and data

- **Default hardware**: 2 nodes × 8 H100 GPUs (16× H100, `16-mixed` precision)
- **Optimiser**: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2
- **Batch size**: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0
- **Steps**: 500 k (≈ ~29 epochs over the enriched training set)
- **Wall-clock on 16× H100**: ~52 hours (≈ 2.65 train it/s sustained)
- **8-GPU fallback**: 1 node × 8 H100 with `trainer.accumulate_grad_batches=2`,
  measured at **≈ 1.45 train it/s** → ~96 hours (~4 days) end-to-end.
  Requires `trainer.strategy.static_graph=false` and
  `trainer.strategy.find_unused_parameters=true` to play well with the
  self-conditioning skip in the loss; both flags are set automatically by
  `miro/slurm/launch_multicad_synth_8gpu.py`.
- **Data**: [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) +
  [LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5)
  filtered to `aesthetic_score >= 6.0` (the higher-quality subset), encoded to
  SDXL VAE latents at 256 resolution. Each sample is paired with seven reward
  scores and FLAN-T5-XL embeddings of both the original and a synthetic
  caption, computed by
  [`miro/data/preprocess_data.py`](https://github.com/nicolas-dufour/miro/blob/main/data/preprocess_data.py).

## Limitations and intended use

This checkpoint is a research artifact released to reproduce and build on the
MIRO paper. Known limitations:

- **Resolution**: 256×256 only. Higher-resolution outputs require upscaling.
- **Domain**: trained on web-scraped image–caption pairs (CC12M + LAION
  Aesthetics 6.0). Inherits the biases of those datasets — including
  under-representation of many cultures, languages, and concepts, and the
  presence of stereotypes. Generations may reflect or amplify these biases.
- **Reward-model biases**: the seven reward predictors used during training
  encode their own biases (e.g. aesthetic and human-preference models reflect
  the taste of their annotator pools). Conditioning on these rewards inherits
  and can sharpen those biases.
- **Not for safety-critical use**: outputs are not factual and the SciScore
  reward does not guarantee scientific accuracy.
- **No safety filter** is shipped with the model; users deploying it in
  user-facing settings should add their own.

The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL
encoder it depends on at inference time are loaded from
[`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae) and
[`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) and are
subject to their respective licenses.

## Citation

```bibtex
@inproceedings{dufour2026miro,
  title     = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency},
  author    = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}
```

## License

MIT — see <https://github.com/nicolas-dufour/miro/blob/main/LICENSE>.