PromptEcho LoRA for Qwen-Image

LoRA adapter for Qwen-Image-2512, fine-tuned with PromptEcho — an annotation-free reinforcement learning method that uses a frozen vision-language model (VLM) to provide reward signals. Instead of relying on human preference labels, PromptEcho computes the negative cross-entropy loss of re-generating the text prompt conditioned on the generated image via a frozen VLM. This log-probability score serves as a deterministic, single-forward-pass reward that is optimized through GRPO/AWM-style policy updates. The resulting model produces images that more faithfully follow complex, detail-rich text prompts.

Model Details

Model type: LoRA adapter for a DiT-based text-to-image diffusion model
Base model: Qwen/Qwen-Image-2512
LoRA rank (r): 64
LoRA alpha: 128
LoRA dropout: 0.0
Target modules: attn.to_q, attn.to_k, attn.to_v, attn.to_out.0, attn.add_q_proj, attn.add_k_proj, attn.add_v_proj, attn.to_add_out, txt_mlp.net.0.proj, txt_mlp.net.2, img_mlp.net.0.proj, img_mlp.net.2
License: Apache 2.0
Paper: PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
Repository: https://huggingface.co/robotxx/prompt-echo-qwenimage

How It Works

The image generation model produces multiple images from a sampled text caption.
Each image is fed into a frozen VLM alongside a fixed query (e.g., "Describe this image in detail."). The VLM computes the log-probability of the caption tokens, yielding the reward: R = -CrossEntropyLoss(caption | image, query).
A group advantage is computed and used to update the generation model via AWM/GRPO policy optimization.

No human annotations are required at any stage.

Usage

Requirements

pip install torch diffusers peft accelerate transformers pillow tqdm

Quick Start

import torch
from diffusers import DiffusionPipeline
from peft import PeftModel

# Load base pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "Qwen/Qwen-Image-2512",
    torch_dtype=torch.bfloat16,
)

# Load and merge LoRA
transformer = PeftModel.from_pretrained(
    pipeline.transformer,
    "robotxx/prompt-echo-qwenimage",
)
transformer = transformer.merge_and_unload()
pipeline.transformer = transformer

# Move to GPU
device = "cuda"
pipeline.vae.to(device, dtype=torch.float32)
pipeline.text_encoder.to(device, dtype=torch.bfloat16)
pipeline.transformer.to(device, dtype=torch.bfloat16)

# Generate
image = pipeline(
    prompt="A golden retriever sitting on a red velvet couch in a sunlit Victorian living room",
    num_inference_steps=30,
    true_cfg_scale=4.0,
    height=1024,
    width=1024,
).images[0]
image.save("output.png")

Inference Script

For batch generation with the included inference script:

# Single GPU
python inference/infer_qwenimage.py \
    --base_model_path Qwen/Qwen-Image-2512 \
    --lora_path ./qwenimage-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_qwenimage

# Multi-GPU (8x)
accelerate launch --num_processes 8 inference/infer_qwenimage.py \
    --base_model_path Qwen/Qwen-Image-2512 \
    --lora_path ./qwenimage-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_qwenimage

Inference Parameters

Parameter	Default	Description
`num_inference_steps`	30	Number of denoising steps
`true_cfg_scale`	4.0	Norm-guided classifier-free guidance scale
`resolution`	1024	Output image resolution (height = width)
`max_sequence_length`	512	Maximum prompt token length
`batch_size`	1	Per-GPU batch size

Training Details

Training Procedure

The adapter was trained using PromptEcho, an annotation-free RL framework:

RL algorithm: AWM/GRPO-style policy optimization
Reward signal: Negative cross-entropy loss from a frozen VLM conditioned on the generated image
Training regime: bf16 mixed precision
LoRA initialization: Gaussian

Training Hyperparameters

Hyperparameter	Value
LoRA rank (r)	64
LoRA alpha	128
LoRA dropout	0.0
Bias	none

Evaluation

Important: No benchmark-specific training was performed. The training data consists of ~200-word dense captions generated by Qwen3-VL-32B, which differ significantly in distribution from all evaluation benchmarks below. All improvements reflect genuine generalization of prompt-following ability.

DenseAlignBench (Ours)

Pairwise evaluation using Gemini-3-flash-preview as VLM judge, with random image-order shuffling to mitigate position bias.

Comparison	Win Rate	Baseline Win Rate	Tie Rate	Net Advantage
QwenImage + PromptEcho vs Baseline	53.3%	37.0%	9.7%	+16.2pp

GenEval

Structured evaluation covering single/two objects, counting, colors, position, and attribute binding.

Model	Single Obj.	Two Obj.	Counting	Colors	Position	Attr. Bind.	Overall
QwenImage-2512 (Baseline)	0.99	0.94	0.56	0.87	0.47	0.64	0.74
+ PromptEcho	0.99	0.93	0.68	0.90	0.55	0.70	0.79 (+5pp)

Key improvements: Counting +12pp, Position +8pp, Attribute Binding +6pp.

DPG-Bench

Dense prompt semantic alignment evaluation across global, entity, attribute, relation, and other categories (scored by mPLUG-large VQA).

Model	Global	Entity	Attribute	Relation	Other	Overall
QwenImage-2512 (Baseline)	94.40	93.27	90.01	92.82	91.34	87.32
+ PromptEcho	91.21	93.25	90.39	93.63	93.13	87.49

TIIFBench

Fine-grained instruction following benchmark covering 40 dimensions across Basic, Advanced, and Real World categories (scored by GPT-4o). Each cell shows short/long description scores.

Model	Overall (short)	Overall (long)	Basic Avg	Advanced Avg	Real World
QwenImage-2512 (Baseline)	84.89	83.25	85.5/84.8	80.3/82.2	92.9/93.7
+ PromptEcho	85.50	86.46	87.1/87.2	80.6/85.4	96.2/95.0

Key improvements: Reasoning +6.1pp (short), Text rendering 95.5→99.1 (short), Real World +3.3pp (short), Overall +3.2pp (long).

Citation

@article{promptecho2025,
  title={PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning},
  author={PromptEcho Team},
  journal={arXiv preprint arXiv:2604.12652},
  year={2025}
}

Framework Versions

PEFT: 0.18.0
Diffusers: >= 0.30.0
PyTorch: >= 2.0

Downloads last month: -

Model tree for robotxx/prompt-echo-qwenimage

Base model

Qwen/Qwen-Image-2512

Adapter

(153)

this model

Paper for robotxx/prompt-echo-qwenimage

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Paper • 2604.12652 • Published 28 days ago • 1