File size: 9,518 Bytes
22de288
 
e61d1d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c893ed7
22de288
e61d1d5
 
 
 
 
 
 
 
 
 
b0f6799
 
e61d1d5
 
c893ed7
e61d1d5
c893ed7
e61d1d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
432325a
e61d1d5
 
432325a
 
e61d1d5
 
 
432325a
e61d1d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
e61d1d5
432325a
 
 
 
 
 
 
 
 
 
c893ed7
562355d
 
e61d1d5
 
c893ed7
e61d1d5
562355d
e61d1d5
562355d
e61d1d5
562355d
 
 
 
 
e61d1d5
562355d
 
 
 
 
 
 
e61d1d5
562355d
e61d1d5
562355d
 
 
 
 
e61d1d5
c893ed7
562355d
 
e61d1d5
 
 
 
 
 
562355d
 
 
 
 
 
 
 
e61d1d5
 
432325a
e61d1d5
 
 
 
 
432325a
e61d1d5
 
 
 
432325a
e61d1d5
 
 
 
 
 
 
 
c893ed7
 
 
e61d1d5
 
 
 
 
 
 
 
432325a
e61d1d5
 
 
 
 
 
c7e493f
 
 
 
 
 
 
 
e61d1d5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3-VL-8B-Instruct
tags:
- agent
- image-generation
- tool-use
- visual-reasoning
- self-distillation
- grpo
- reinforcement-learning
- multimodal
- qwen3-vl
datasets:
- MeiGen-AI/GenEvolve-Data-Bench
---

<div align="center">

<img src="assets/logo_genevolve.png" alt="GenEvolve" width="160">

<h1>GenEvolve</h1>

<p><strong><em>Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation</em></strong></p>

<p>
  <a href="https://arxiv.org/abs/2605.21605">
    <img alt="Paper" src="https://img.shields.io/badge/πŸ“„_Paper-arXiv:2605.21605-b31b1b"></a>
  <a href="https://ephemeral182.github.io/GenEvolve/">
    <img alt="Project Page" src="https://img.shields.io/badge/🌐_Project-Page-1f6feb"></a>
  <a href="https://github.com/MeiGen-AI/GenEvolve">
    <img alt="Code" src="https://img.shields.io/badge/πŸ’Ύ_GitHub-Code-181717"></a>
  <a href="https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench">
    <img alt="Dataset" src="https://img.shields.io/badge/πŸ€—_Dataset-GenEvolve--Data-FFD21E"></a>
</p>

</div>

This repository hosts the **GenEvolve agent policy** β€” a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...).

<div align="center">
<img src="assets/teaser.jpg" alt="GenEvolve teaser" width="100%">

<p><em>The same trained agent policy paired with two reference-conditioned generators ⟢<br>
<strong>Qwen-Image-Edit (open)</strong> &nbsp;Β·&nbsp; <strong>Nano Banana Pro (strong)</strong></em></p>
</div>

---

## ✨ Highlights

- **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`.
- **Self-evolution with Visual Experience Distillation.** Best-vs-worst trajectory pairs are distilled token-level into the deployed student. **No runtime memory at inference.**
- **Generator-transferable.** The same trained policy works with both an open-source generator (Qwen-Image-Edit-2511) and a strong proprietary generator (Nano Banana Pro).

## πŸ“Š Headline Results

### GenEvolve-Bench (KScore, held-out split)

| Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. |
|---|---|---:|---:|---:|
| Qwen-Image (raw) | Qwen-Image | 0.2987 | 0.2384 | 0.3768 |
| Nano Banana Pro (raw) | Nano Banana Pro | 0.5298 | 0.5160 | 0.5477 |
| Gen-Searcher 8B | Qwen-Image-Edit-2511 | 0.3493 | 0.3293 | 0.3745 |
| Gen-Searcher 8B | Nano Banana Pro | 0.5481 | 0.5472 | 0.5492 |
| **GenEvolve (Ours)** | Qwen-Image-Edit-2511 | **0.3663** | **0.3410** | **0.3990** |
| **GenEvolve (Ours)** | Nano Banana Pro | **0.5739** | **0.5669** | **0.5830** |

### WISE Benchmark (WiScore, six knowledge categories)

| Model | Cultural | Time | Space | Biology | Physics | Chemistry | **Overall** |
|---|---:|---:|---:|---:|---:|---:|---:|
| GPT-4o | 0.81 | 0.71 | **0.89** | **0.83** | 0.79 | 0.74 | 0.80 |
| Gen-Searcher-8B + Qwen-Image | 0.80 | 0.71 | 0.82 | 0.76 | 0.74 | 0.75 | 0.77 |
| Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 |
| **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** |

---

## 🧠 Method Overview

<p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p>

For a user request, the agent samples a multi-turn trajectory of tool calls before emitting the final prompt-reference program. The downstream generator then renders the image.

---

## πŸ–ΌοΈ Visual Demos

<p align="center"><img src="assets/visual_comparison.png" alt="Qualitative comparison" width="100%"></p>

<p align="center"><sub>Qualitative comparison on representative cases. <span style="color:#D97706">Orange</span> marks external/uncommon knowledge requirements; <span style="color:#2563EB">blue</span> marks internal generation-knowledge requirements.</sub></p>

### 🎨 Gallery β€” paired with Nano Banana Pro

<p align="center"><img src="assets/gallery_nano.jpg" alt="GenEvolve + Nano Banana Pro gallery" width="100%"></p>

<p align="center"><sub>The same agent policy with Nano Banana Pro as the downstream renderer. Examples cover spatial layout, text rendering, quantity counting, attribute binding, anatomy/pose, creative transfer, material physics, and aesthetic drawing.</sub></p>

### 🎨 Gallery β€” paired with Qwen-Image-Edit (open)

<p align="center"><img src="assets/gallery_qwen.jpg" alt="GenEvolve + Qwen-Image-Edit gallery" width="100%"></p>

<p align="center"><sub>Same trained policy paired with the open-source Qwen-Image-Edit-2511 renderer; consistent quality across both generators reflects generator-transferable orchestration.</sub></p>

---

## πŸš€ Quick Start

The deployed checkpoint is the **student policy** β€” it consumes a user prompt and returns a JSON `gen_prompt + reference_images` program through a `<think>/<tool_call>/<answer>` loop. The end-to-end runtime (vLLM serving + agent loop + tools + Qwen/Nano renderers) lives in the [GitHub repo](https://github.com/MeiGen-AI/GenEvolve); the snippet below mirrors its installation and usage.

### 1. Install the main GenEvolve runtime

```bash
git clone https://github.com/MeiGen-AI/GenEvolve.git
cd GenEvolve

conda create -n genevolve python=3.11 -y && conda activate genevolve
pip install -U pip setuptools wheel packaging psutil ninja
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install --no-build-isolation -r requirements.txt
pip install -e .
```

Qwen-Image-Edit rendering runs as a **separate FastAPI service** (kept out of the vLLM environment to avoid CUDA/diffusers conflicts). Set up that service from the GitHub README when you want to use `--backend qwen-image-edit-service`.

### 2. Serve the agent policy

```bash
# Single GPU / single replica.
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=1 bash scripts/serve_vllm.sh

# Higher throughput on one 8-GPU node (8 replicas, 1 GPU each).
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
```

`TP` shards one model replica across multiple GPUs; `DP` launches multiple replicas; total GPU usage is `TP Γ— DP`.

### 3. End-to-end example

```bash
export SERPER_API_KEY=<your_key>      # required for search / image_search
export GOOGLE_API_KEY=<your_key>      # or GEMINI_API_KEY; only for --backend nano-banana-pro

# Nano Banana Pro renderer
python examples/quickstart.py \
    --backend nano-banana-pro \
    --base-url http://localhost:8000/v1 \
    --model GenEvolve \
    --prompt "A 1990s travel-magazine cover of two backpackers in front of the Eiffel Tower at golden hour, the title \"PARIS\" in bold serif." \
    --output paris.png

# Qwen-Image-Edit renderer (point at your Qwen-Image-Edit FastAPI service)
python examples/quickstart.py \
    --backend qwen-image-edit-service \
    --service-url http://your-qwen-service:8001 \
    --base-url http://localhost:8000/v1 \
    --model GenEvolve \
    --output paris_qwen.png
```

The agent's final `<answer>` is a JSON object:

```json
{
  "gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...",
  "reference_images": [
    {"img_id": "IMG_001", "note": "what to copy from this image"}
  ]
}
```

`gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β€” never raw `IMG_###` ids or URLs. Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image.

---

## πŸ—‚οΈ Related Artifacts

| Artifact | Link |
|---|---|
| Project page | https://ephemeral182.github.io/GenEvolve/ |
| Paper | Coming soon |
| Code | https://github.com/MeiGen-AI/GenEvolve |
| Training data + benchmark | [MeiGen-AI/GenEvolve-Data-Bench](https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench) |
| Base model | [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) |

---

## βš–οΈ Intended Use, Limits, Bias

- **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes.
- **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in.
- **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases that may be reflected in agent outputs.

---

## πŸ“‘ Citation

```bibtex
@misc{chen2026genevolveselfevolvingimagegeneration,
      title={GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation}, 
      author={Sixiang Chen and Zhaohu Xing and Tian Ye and Xinyu Geng and Yunlong Lin and Jianyu Lai and Xuanhua He and Fuxiang Zhai and Jialin Gao and Lei Zhu},
      year={2026},
      eprint={2605.21605},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.21605}, 
}
```