File size: 4,944 Bytes
b67e8f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: mit
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - diffusers
  - minit2i
  - image-generation
  - text-to-image
  - flow-matching
  - pixel-space
inference: true
widget:
  - text: A lonely astronaut standing on a quiet beach under two moons.
    output:
      url: MiniT2I-B-16/demo.png
language:
  - en
---

# BiliSakura/MiniT2I-diffusers

Self-contained MiniT2I text-to-image checkpoints for Hugging Face diffusers. Each variant folder ships its own pipeline code, component modules, bundled FLAN-T5-Large text encoder, and transformer weights.

Converted from [`MiniT2I/MiniT2I`](https://huggingface.co/MiniT2I/MiniT2I) using [MiniT2I-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/MiniT2I-diffusers) in [Visual-Generative-Foundation-Model-Collection](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection).

## Available checkpoints

| Subfolder | Model | Params (denoiser + text encoder) | Patch | Recommended CFG |
| --- | --- | --- | ---: | ---: |
| [`MiniT2I-B-16/`](MiniT2I-B-16/) | MiniT2I-B/16 | 258M + 341M | 16 | 2.5 |
| [`MiniT2I-L-16/`](MiniT2I-L-16/) | MiniT2I-L/16 | 912M + 341M | 16 | 6.0 |

## Repo layout

```text
BiliSakura/MiniT2I-diffusers/
β”œβ”€β”€ README.md
β”œβ”€β”€ MiniT2I-B-16/
β”‚   β”œβ”€β”€ pipeline.py
β”‚   β”œβ”€β”€ model_index.json
β”‚   β”œβ”€β”€ conversion_metadata.json
β”‚   β”œβ”€β”€ demo.png
β”‚   β”œβ”€β”€ scheduler/
β”‚   β”‚   └── scheduler_config.json
β”‚   β”œβ”€β”€ text_encoder/
β”‚   β”œβ”€β”€ tokenizer/
β”‚   └── transformer/
β”‚       β”œβ”€β”€ config.json
β”‚       β”œβ”€β”€ diffusion_pytorch_model.safetensors
β”‚       └── transformer_minit2i.py
└── MiniT2I-L-16/
    └── ...
```

Each variant is self-contained: load with `custom_pipeline=.../pipeline.py` and `trust_remote_code=True`. MiniT2I denoises directly in RGB pixel space (no VAE).

## Demo

![MiniT2I-B-16 demo](MiniT2I-B-16/demo.png)

Prompt: *"A lonely astronaut standing on a quiet beach under two moons."* β€” MiniT2I-B/16 at 512Γ—512, 100 steps, `guidance_scale=2.5`, seed 42.

## Load from Hugging Face

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "BiliSakura/MiniT2I-diffusers/MiniT2I-B-16",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
    "A lonely astronaut standing on a quiet beach under two moons.",
    num_inference_steps=100,
    guidance_scale=2.5,
    generator=generator,
).images[0]
image.save("demo.png")
```

For MiniT2I-L/16, use `MiniT2I-L-16` and `guidance_scale=6.0`.

## Load from a local clone

```python
from pathlib import Path
import torch
from diffusers import DiffusionPipeline

model_dir = Path("./MiniT2I-B-16").resolve()
pipe = DiffusionPipeline.from_pretrained(
    str(model_dir),
    local_files_only=True,
    custom_pipeline=str(model_dir / "pipeline.py"),
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
    "A lonely astronaut standing on a quiet beach under two moons.",
    num_inference_steps=100,
    guidance_scale=2.5,
    generator=generator,
).images[0]
image.save("demo.png")
```

Load a **variant subfolder** (e.g. `./MiniT2I-B-16`), not the repo root.

## Recommended inference settings

| Variant | Resolution | Steps | CFG scale | `torch_dtype` |
| --- | --- | ---: | ---: | --- |
| `MiniT2I-B-16` | 512Γ—512 | 100 | 2.5 | `bfloat16` |
| `MiniT2I-L-16` | 512Γ—512 | 100 | 6.0 | `bfloat16` |

For GenEval / DPG-Bench evaluation, upstream configs use `guidance_scale=5.0` for both B/16 and L/16.

## Interface notes

- Text conditioning uses bundled `google/flan-t5-large` (`T5EncoderModel` + `T5Tokenizer`).
- Scheduler is `FlowMatchEulerDiscreteScheduler` with 1000 training timesteps and `shift=1.0`.
- `guidance_scale > 1.0` enables classifier-free guidance with an empty-string null prompt.
- Output resolution is fixed at 512Γ—512 for these exports.

## Regenerate bundles

From the repository root:

```bash
conda activate rsgen
python scripts/convert_minit2i_to_bilisakura.py
```

## Links

- Blog: [MiniT2I: A Minimalist Baseline for Text-to-Image Generation](https://peppaking8.github.io/#/post/minit2i)
- Upstream checkpoints: [MiniT2I/MiniT2I](https://huggingface.co/MiniT2I/MiniT2I)
- PyTorch/Diffusers source: [MiniT2I-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/MiniT2I-diffusers)

## Citation

```bibtex
@misc{minit2i2026,
  title  = {MiniT2I: A Minimalist Baseline for Text-to-Image Generation},
  author = {Wang, Xianbang and Zhao, Hanhong and Lu, Yiyang and Zhou, Kangyang and Ma, Linrui and He, Kaiming},
  year   = {2026},
  url    = {https://peppaking8.github.io/#/post/minit2i}
}
```