import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image-2512", dtype=torch.bfloat16, device_map="cuda")
pipe.load_lora_weights("robotxx/prompt-echo-qwenimage")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]PromptEcho LoRA for Qwen-Image
LoRA adapter for Qwen-Image-2512, fine-tuned with PromptEcho — an annotation-free reinforcement learning method that uses a frozen vision-language model (VLM) to provide reward signals. Instead of relying on human preference labels, PromptEcho computes the negative cross-entropy loss of re-generating the text prompt conditioned on the generated image via a frozen VLM. This log-probability score serves as a deterministic, single-forward-pass reward that is optimized through GRPO/AWM-style policy updates. The resulting model produces images that more faithfully follow complex, detail-rich text prompts.
Model Details
- Model type: LoRA adapter for a DiT-based text-to-image diffusion model
- Base model: Qwen/Qwen-Image-2512
- LoRA rank (r): 64
- LoRA alpha: 128
- LoRA dropout: 0.0
- Target modules:
attn.to_q,attn.to_k,attn.to_v,attn.to_out.0,attn.add_q_proj,attn.add_k_proj,attn.add_v_proj,attn.to_add_out,txt_mlp.net.0.proj,txt_mlp.net.2,img_mlp.net.0.proj,img_mlp.net.2 - License: Apache 2.0
- Paper: PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
- Repository: https://huggingface.co/robotxx/prompt-echo-qwenimage
How It Works
- The image generation model produces multiple images from a sampled text caption.
- Each image is fed into a frozen VLM alongside a fixed query (e.g., "Describe this image in detail."). The VLM computes the log-probability of the caption tokens, yielding the reward:
R = -CrossEntropyLoss(caption | image, query). - A group advantage is computed and used to update the generation model via AWM/GRPO policy optimization.
No human annotations are required at any stage.
Usage
Requirements
pip install torch diffusers peft accelerate transformers pillow tqdm
Quick Start
import torch
from diffusers import DiffusionPipeline
from peft import PeftModel
# Load base pipeline
pipeline = DiffusionPipeline.from_pretrained(
"Qwen/Qwen-Image-2512",
torch_dtype=torch.bfloat16,
)
# Load and merge LoRA
transformer = PeftModel.from_pretrained(
pipeline.transformer,
"robotxx/prompt-echo-qwenimage",
)
transformer = transformer.merge_and_unload()
pipeline.transformer = transformer
# Move to GPU
device = "cuda"
pipeline.vae.to(device, dtype=torch.float32)
pipeline.text_encoder.to(device, dtype=torch.bfloat16)
pipeline.transformer.to(device, dtype=torch.bfloat16)
# Generate
image = pipeline(
prompt="A golden retriever sitting on a red velvet couch in a sunlit Victorian living room",
num_inference_steps=30,
true_cfg_scale=4.0,
height=1024,
width=1024,
).images[0]
image.save("output.png")
Inference Script
For batch generation with the included inference script:
# Single GPU
python inference/infer_qwenimage.py \
--base_model_path Qwen/Qwen-Image-2512 \
--lora_path ./qwenimage-prompt_echo_lora \
--caption_jsonl ./metadata.jsonl \
--output_dir ./output_qwenimage
# Multi-GPU (8x)
accelerate launch --num_processes 8 inference/infer_qwenimage.py \
--base_model_path Qwen/Qwen-Image-2512 \
--lora_path ./qwenimage-prompt_echo_lora \
--caption_jsonl ./metadata.jsonl \
--output_dir ./output_qwenimage
Inference Parameters
| Parameter | Default | Description |
|---|---|---|
num_inference_steps |
30 | Number of denoising steps |
true_cfg_scale |
4.0 | Norm-guided classifier-free guidance scale |
resolution |
1024 | Output image resolution (height = width) |
max_sequence_length |
512 | Maximum prompt token length |
batch_size |
1 | Per-GPU batch size |
Training Details
Training Procedure
The adapter was trained using PromptEcho, an annotation-free RL framework:
- RL algorithm: AWM/GRPO-style policy optimization
- Reward signal: Negative cross-entropy loss from a frozen VLM conditioned on the generated image
- Training regime: bf16 mixed precision
- LoRA initialization: Gaussian
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| LoRA rank (r) | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.0 |
| Bias | none |
Evaluation
Important: No benchmark-specific training was performed. The training data consists of ~200-word dense captions generated by Qwen3-VL-32B, which differ significantly in distribution from all evaluation benchmarks below. All improvements reflect genuine generalization of prompt-following ability.
DenseAlignBench (Ours)
Pairwise evaluation using Gemini-3-flash-preview as VLM judge, with random image-order shuffling to mitigate position bias.
| Comparison | Win Rate | Baseline Win Rate | Tie Rate | Net Advantage |
|---|---|---|---|---|
| QwenImage + PromptEcho vs Baseline | 53.3% | 37.0% | 9.7% | +16.2pp |
GenEval
Structured evaluation covering single/two objects, counting, colors, position, and attribute binding.
| Model | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Bind. | Overall |
|---|---|---|---|---|---|---|---|
| QwenImage-2512 (Baseline) | 0.99 | 0.94 | 0.56 | 0.87 | 0.47 | 0.64 | 0.74 |
| + PromptEcho | 0.99 | 0.93 | 0.68 | 0.90 | 0.55 | 0.70 | 0.79 (+5pp) |
Key improvements: Counting +12pp, Position +8pp, Attribute Binding +6pp.
DPG-Bench
Dense prompt semantic alignment evaluation across global, entity, attribute, relation, and other categories (scored by mPLUG-large VQA).
| Model | Global | Entity | Attribute | Relation | Other | Overall |
|---|---|---|---|---|---|---|
| QwenImage-2512 (Baseline) | 94.40 | 93.27 | 90.01 | 92.82 | 91.34 | 87.32 |
| + PromptEcho | 91.21 | 93.25 | 90.39 | 93.63 | 93.13 | 87.49 |
TIIFBench
Fine-grained instruction following benchmark covering 40 dimensions across Basic, Advanced, and Real World categories (scored by GPT-4o). Each cell shows short/long description scores.
| Model | Overall (short) | Overall (long) | Basic Avg | Advanced Avg | Real World |
|---|---|---|---|---|---|
| QwenImage-2512 (Baseline) | 84.89 | 83.25 | 85.5/84.8 | 80.3/82.2 | 92.9/93.7 |
| + PromptEcho | 85.50 | 86.46 | 87.1/87.2 | 80.6/85.4 | 96.2/95.0 |
Key improvements: Reasoning +6.1pp (short), Text rendering 95.5→99.1 (short), Real World +3.3pp (short), Overall +3.2pp (long).
Citation
@article{promptecho2025,
title={PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning},
author={PromptEcho Team},
journal={arXiv preprint arXiv:2604.12652},
year={2025}
}
Framework Versions
- PEFT: 0.18.0
- Diffusers: >= 0.30.0
- PyTorch: >= 2.0
- Downloads last month
- -
Model tree for robotxx/prompt-echo-qwenimage
Base model
Qwen/Qwen-Image-2512