miro / README.md

Upload main weights

2fb12fe verified 7 days ago

7.57 kB

	---
	license: mit
	library_name: miro-t2i
	tags:
	- text-to-image
	- diffusion
	- flow-matching
	- miro
	- reward-conditioning
	pipeline_tag: text-to-image
	---

	# MIRO (main)

	![Qualitative samples from MIRO](teaser.jpg)

	<sub>Qualitative samples from the released MIRO checkpoint — same gallery as the
	teaser of the [project page](https://nicolas-dufour.github.io/miro/).</sub>

	Main MIRO checkpoint. Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions.

	This checkpoint accompanies the paper
	MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
	(Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026).

	\| \| \|
	\|---\|---\|
	\| Paper \| <https://arxiv.org/abs/2510.25897> \|
	\| Project page \| <https://nicolas-dufour.github.io/miro/> \|
	\| Code \| <https://github.com/nicolas-dufour/miro> \|
	\| Demo \| [🤗 Space](https://huggingface.co/spaces/nicolas-dufour/miro) (Gradio + ZeroGPU) \|
	\| Parameters \| 360.4M \|
	\| Resolution \| 256×256 (SDXL VAE latent space) \|
	\| Architecture \| RIN flow-matching backbone, FLAN-T5-XL text conditioning \|
	\| Training data \| [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) + [LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5) (6.0+ aesthetic subset) \|
	\| Reward signals \| `clip_score`, `aesthetic_score`, `image_reward_score`, `pick_a_score_score`, `hpsv2_score`, `vqa_score`, `sciscore_score` \|
	\| Weights \| `model.safetensors`, fp32 (EMA master weights — ready for finetuning) \|

	## Install

	```bash
	pip install miro-t2i
	```

	`miro-t2i` is the public PyPI package; it imports as `import miro`. The first
	call to `MiroPipeline.from_pretrained(...)` will additionally fetch
	[`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) (text encoder)
	and [`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae)
	(latent decoder) from the Hub.

	## Usage

	```python
	import torch
	from miro import MiroPipeline

	pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro")
	pipe = pipe.to("cuda", torch.bfloat16)

	prompt = (
	"Photography closeup portrait of an adorable rusty brokendown steampunk "
	"robot covered in budding vegetation, surrounded by tall grass, misty "
	"futuristic scifi forest environment."
	)
	image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0]
	image.save("out.png")
	```

	### Reward conditioning

	MIRO conditions the flow model on a vector of reward targets in addition to the
	text prompt. By default every reward is requested at its maximum (`1.0`); you
	can override individual axes to bias generation toward a particular trade-off:

	```python
	image = pipe(
	prompt, # the rusty-robot prompt from above
	reward_targets={
	"clip_score": 1.0, # strict prompt alignment
	"aesthetic_score": 0.3, # de-prioritise prettiness
	"image_reward_score": 1.0, # prioritise general human preference
	# any reward not listed defaults to 1.0
	},
	negative_reward_targets={
	# zeros by default; what to push the unconditional branch toward
	},
	guidance_scale=7.0,
	)[0]
	```

	The seven reward dimensions are:

	\| Reward \| Normalised range \| What it measures \|
	\|---\|---\|---\|
	\| `clip_score` \| ~[0, 1] \| CLIP text–image alignment \|
	\| `aesthetic_score` \| ~[0, 1] \| LAION aesthetic-quality predictor \|
	\| `image_reward_score` \| ~[0, 1] \| ImageReward (general preference model) \|
	\| `pick_a_score_score` \| ~[0, 1] \| PickScore (human preference) \|
	\| `hpsv2_score` \| ~[0, 1] \| HPSv2 (human preference v2) \|
	\| `vqa_score` \| ~[0, 1] \| VQAScore (compositional faithfulness) \|
	\| `sciscore_score` \| ~[0, 1] \| SciScore (scientific-image plausibility) \|

	## Reported benchmarks

	The paper reports the following headline numbers for the main MIRO model
	(this repo's `nicolas-dufour/miro`):

	\| Metric \| MIRO (350M) \| FLUX-dev (12B) \|
	\|---\|---\|---\|
	\| GenEval (overall) \| 75 (with inference-time reward tuning) / 68 (default) \| 67 \|
	\| Inference compute \| 1× \| ~370× \|
	\| Aesthetic-metric convergence vs. baseline pretraining \| 19× faster \| — \|

	Per-variant scores (GenEval, FID, individual reward scores) for the eight
	ablations are reported in the paper's ablation tables. Please refer to
	[arXiv:2510.25897](https://arxiv.org/abs/2510.25897) for the full breakdown.

	## Training compute and data

	- Default hardware: 2 nodes × 8 H100 GPUs (16× H100, `16-mixed` precision)
	- Optimiser: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2
	- Batch size: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0
	- Steps: 500 k (≈ ~29 epochs over the enriched training set)
	- Wall-clock on 16× H100: ~52 hours (≈ 2.65 train it/s sustained)
	- 8-GPU fallback: 1 node × 8 H100 with `trainer.accumulate_grad_batches=2`,
	measured at ≈ 1.45 train it/s → ~96 hours (~4 days) end-to-end.
	Requires `trainer.strategy.static_graph=false` and
	`trainer.strategy.find_unused_parameters=true` to play well with the
	self-conditioning skip in the loss; both flags are set automatically by
	`miro/slurm/launch_multicad_synth_8gpu.py`.
	- Data: [CC12M](https://huggingface.co/datasets/pixparse/cc12m-wds) +
	[LAION Aesthetics v2 4.5](https://huggingface.co/datasets/laion/aesthetics_v2_4.5)
	filtered to `aesthetic_score >= 6.0` (the higher-quality subset), encoded to
	SDXL VAE latents at 256 resolution. Each sample is paired with seven reward
	scores and FLAN-T5-XL embeddings of both the original and a synthetic
	caption, computed by
	[`miro/data/preprocess_data.py`](https://github.com/nicolas-dufour/miro/blob/main/data/preprocess_data.py).

	## Limitations and intended use

	This checkpoint is a research artifact released to reproduce and build on the
	MIRO paper. Known limitations:

	- Resolution: 256×256 only. Higher-resolution outputs require upscaling.
	- Domain: trained on web-scraped image–caption pairs (CC12M + LAION
	Aesthetics 6.0). Inherits the biases of those datasets — including
	under-representation of many cultures, languages, and concepts, and the
	presence of stereotypes. Generations may reflect or amplify these biases.
	- Reward-model biases: the seven reward predictors used during training
	encode their own biases (e.g. aesthetic and human-preference models reflect
	the taste of their annotator pools). Conditioning on these rewards inherits
	and can sharpen those biases.
	- Not for safety-critical use: outputs are not factual and the SciScore
	reward does not guarantee scientific accuracy.
	- No safety filter is shipped with the model; users deploying it in
	user-facing settings should add their own.

	The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL
	encoder it depends on at inference time are loaded from
	[`stabilityai/sdxl-vae`](https://huggingface.co/stabilityai/sdxl-vae) and
	[`google/flan-t5-xl`](https://huggingface.co/google/flan-t5-xl) and are
	subject to their respective licenses.

	## Citation

	```bibtex
	@inproceedings{dufour2026miro,
	title = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency},
	author = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
	booktitle = {International Conference on Machine Learning (ICML)},
	year = {2026}
	}
	```

	## License

	MIT — see <https://github.com/nicolas-dufour/miro/blob/main/LICENSE>.