How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image-2512", dtype=torch.bfloat16, device_map="cuda")
pipe.load_lora_weights("robotxx/prompt-echo-qwenimage")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

PromptEcho LoRA for Qwen-Image

LoRA adapter for Qwen-Image-2512, fine-tuned with PromptEcho — an annotation-free reinforcement learning method that uses a frozen vision-language model (VLM) to provide reward signals. Instead of relying on human preference labels, PromptEcho computes the negative cross-entropy loss of re-generating the text prompt conditioned on the generated image via a frozen VLM. This log-probability score serves as a deterministic, single-forward-pass reward that is optimized through GRPO/AWM-style policy updates. The resulting model produces images that more faithfully follow complex, detail-rich text prompts.

Model Details

How It Works

  1. The image generation model produces multiple images from a sampled text caption.
  2. Each image is fed into a frozen VLM alongside a fixed query (e.g., "Describe this image in detail."). The VLM computes the log-probability of the caption tokens, yielding the reward: R = -CrossEntropyLoss(caption | image, query).
  3. A group advantage is computed and used to update the generation model via AWM/GRPO policy optimization.

No human annotations are required at any stage.

Usage

Requirements

pip install torch diffusers peft accelerate transformers pillow tqdm

Quick Start

import torch
from diffusers import DiffusionPipeline
from peft import PeftModel

# Load base pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "Qwen/Qwen-Image-2512",
    torch_dtype=torch.bfloat16,
)

# Load and merge LoRA
transformer = PeftModel.from_pretrained(
    pipeline.transformer,
    "robotxx/prompt-echo-qwenimage",
)
transformer = transformer.merge_and_unload()
pipeline.transformer = transformer

# Move to GPU
device = "cuda"
pipeline.vae.to(device, dtype=torch.float32)
pipeline.text_encoder.to(device, dtype=torch.bfloat16)
pipeline.transformer.to(device, dtype=torch.bfloat16)

# Generate
image = pipeline(
    prompt="A golden retriever sitting on a red velvet couch in a sunlit Victorian living room",
    num_inference_steps=30,
    true_cfg_scale=4.0,
    height=1024,
    width=1024,
).images[0]
image.save("output.png")

Inference Script

For batch generation with the included inference script:

# Single GPU
python inference/infer_qwenimage.py \
    --base_model_path Qwen/Qwen-Image-2512 \
    --lora_path ./qwenimage-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_qwenimage

# Multi-GPU (8x)
accelerate launch --num_processes 8 inference/infer_qwenimage.py \
    --base_model_path Qwen/Qwen-Image-2512 \
    --lora_path ./qwenimage-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_qwenimage

Inference Parameters

Parameter Default Description
num_inference_steps 30 Number of denoising steps
true_cfg_scale 4.0 Norm-guided classifier-free guidance scale
resolution 1024 Output image resolution (height = width)
max_sequence_length 512 Maximum prompt token length
batch_size 1 Per-GPU batch size

Training Details

Training Procedure

The adapter was trained using PromptEcho, an annotation-free RL framework:

  • RL algorithm: AWM/GRPO-style policy optimization
  • Reward signal: Negative cross-entropy loss from a frozen VLM conditioned on the generated image
  • Training regime: bf16 mixed precision
  • LoRA initialization: Gaussian

Training Hyperparameters

Hyperparameter Value
LoRA rank (r) 64
LoRA alpha 128
LoRA dropout 0.0
Bias none

Evaluation

Important: No benchmark-specific training was performed. The training data consists of ~200-word dense captions generated by Qwen3-VL-32B, which differ significantly in distribution from all evaluation benchmarks below. All improvements reflect genuine generalization of prompt-following ability.

DenseAlignBench (Ours)

Pairwise evaluation using Gemini-3-flash-preview as VLM judge, with random image-order shuffling to mitigate position bias.

Comparison Win Rate Baseline Win Rate Tie Rate Net Advantage
QwenImage + PromptEcho vs Baseline 53.3% 37.0% 9.7% +16.2pp

GenEval

Structured evaluation covering single/two objects, counting, colors, position, and attribute binding.

Model Single Obj. Two Obj. Counting Colors Position Attr. Bind. Overall
QwenImage-2512 (Baseline) 0.99 0.94 0.56 0.87 0.47 0.64 0.74
+ PromptEcho 0.99 0.93 0.68 0.90 0.55 0.70 0.79 (+5pp)

Key improvements: Counting +12pp, Position +8pp, Attribute Binding +6pp.

DPG-Bench

Dense prompt semantic alignment evaluation across global, entity, attribute, relation, and other categories (scored by mPLUG-large VQA).

Model Global Entity Attribute Relation Other Overall
QwenImage-2512 (Baseline) 94.40 93.27 90.01 92.82 91.34 87.32
+ PromptEcho 91.21 93.25 90.39 93.63 93.13 87.49

TIIFBench

Fine-grained instruction following benchmark covering 40 dimensions across Basic, Advanced, and Real World categories (scored by GPT-4o). Each cell shows short/long description scores.

Model Overall (short) Overall (long) Basic Avg Advanced Avg Real World
QwenImage-2512 (Baseline) 84.89 83.25 85.5/84.8 80.3/82.2 92.9/93.7
+ PromptEcho 85.50 86.46 87.1/87.2 80.6/85.4 96.2/95.0

Key improvements: Reasoning +6.1pp (short), Text rendering 95.5→99.1 (short), Real World +3.3pp (short), Overall +3.2pp (long).

Citation

@article{promptecho2025,
  title={PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning},
  author={PromptEcho Team},
  journal={arXiv preprint arXiv:2604.12652},
  year={2025}
}

Framework Versions

  • PEFT: 0.18.0
  • Diffusers: >= 0.30.0
  • PyTorch: >= 2.0
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robotxx/prompt-echo-qwenimage

Adapter
(153)
this model

Paper for robotxx/prompt-echo-qwenimage