PromptEcho LoRA for Z-Image

LoRA adapter for Z-Image, fine-tuned with PromptEcho — an annotation-free reinforcement learning method that uses a frozen vision-language model (VLM) to provide reward signals. Instead of relying on human preference labels, PromptEcho computes the negative cross-entropy loss of re-generating the text prompt conditioned on the generated image via a frozen VLM. This log-probability score serves as a deterministic, single-forward-pass reward that is optimized through GRPO/AWM-style policy updates. The resulting model produces images that more faithfully follow complex, detail-rich text prompts.

Model Details

Model type: LoRA adapter for a DiT-based text-to-image diffusion model
Base model: Tongyi-MAI/Z-Image
LoRA rank (r): 64
LoRA alpha: 128
LoRA dropout: 0.0
Target modules: attention.to_q, attention.to_k, attention.to_v, attention.to_out.0, feed_forward.w1, feed_forward.w2, feed_forward.w3
License: Apache 2.0
Paper: PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
Repository: https://huggingface.co/robotxx/prompt-echo-z-image

How It Works

The image generation model produces multiple images from a sampled text caption.
Each image is fed into a frozen VLM alongside a fixed query (e.g., "Describe this image in detail."). The VLM computes the log-probability of the caption tokens, yielding the reward: R = -CrossEntropyLoss(caption | image, query).
A group advantage is computed and used to update the generation model via AWM/GRPO policy optimization.

No human annotations are required at any stage.

Usage

Requirements

pip install torch diffusers peft accelerate transformers pillow tqdm

Quick Start

import torch
from diffusers import AutoencoderKL
from diffusers.models import ZImageTransformer2DModel
from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base_model_path = "Tongyi-MAI/Z-Image"
lora_path = "robotxx/prompt-echo-z-image"
device = "cuda"
dtype = torch.bfloat16

# Load model components
vae = AutoencoderKL.from_pretrained(base_model_path, subfolder="vae", torch_dtype=dtype)
transformer = ZImageTransformer2DModel.from_pretrained(base_model_path, subfolder="transformer", torch_dtype=dtype)
text_encoder = AutoModel.from_pretrained(base_model_path, subfolder="text_encoder", torch_dtype=dtype, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_path, subfolder="tokenizer", trust_remote_code=True)
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")

# Load and merge LoRA
transformer = PeftModel.from_pretrained(transformer, lora_path)
transformer = transformer.merge_and_unload()

# Move to GPU
vae.to(device, dtype=torch.float32)
text_encoder.to(device, dtype=dtype)
transformer.to(device, dtype=dtype)

Inference Script

For batch generation with the included inference script:

# Single GPU
python inference/infer_z_image.py \
    --base_model_path Tongyi-MAI/Z-Image \
    --lora_path ./z-image-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_z_image

# Multi-GPU (8x)
accelerate launch --num_processes 8 inference/infer_z_image.py \
    --base_model_path Tongyi-MAI/Z-Image \
    --lora_path ./z-image-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_z_image

Inference Parameters

Parameter	Default	Description
`num_inference_steps`	30	Number of denoising steps
`guidance_scale`	4.0	Classifier-free guidance scale
`resolution`	1024	Output image resolution (height = width)
`max_sequence_length`	512	Maximum prompt token length
`batch_size`	1	Per-GPU batch size

Training Details

Training Procedure

The adapter was trained using PromptEcho, an annotation-free RL framework:

RL algorithm: AWM/GRPO-style policy optimization
Reward signal: Negative cross-entropy loss from a frozen VLM conditioned on the generated image
Training regime: bf16 mixed precision
LoRA initialization: Gaussian

Training Hyperparameters

Hyperparameter	Value
LoRA rank (r)	64
LoRA alpha	128
LoRA dropout	0.0
Bias	none

Evaluation

Important: No benchmark-specific training was performed. The training data consists of ~200-word dense captions generated by Qwen3-VL-32B, which differ significantly in distribution from all evaluation benchmarks below. All improvements reflect genuine generalization of prompt-following ability.

DenseAlignBench (Ours)

Pairwise evaluation using Gemini-3-flash-preview as VLM judge, with random image-order shuffling to mitigate position bias.

Comparison	Win Rate	Baseline Win Rate	Tie Rate	Net Advantage
Z-Image + PromptEcho vs Baseline	61.5%	34.7%	3.8%	+26.8pp

GenEval

Structured evaluation covering single/two objects, counting, colors, position, and attribute binding.

Model	Single Obj.	Two Obj.	Counting	Colors	Position	Attr. Bind.	Overall
Z-Image (Baseline)	0.99	0.91	0.75	0.86	0.41	0.59	0.75
+ PromptEcho	1.00	0.94	0.85	0.86	0.52	0.73	0.82 (+6.5pp)

Key improvements: Attribute Binding +14pp, Position +11pp, Counting +10pp.

DPG-Bench

Dense prompt semantic alignment evaluation across global, entity, attribute, relation, and other categories (scored by mPLUG-large VQA).

Model	Global	Entity	Attribute	Relation	Other	Overall
Z-Image (Baseline)	91.60	91.54	90.32	92.76	91.94	86.90
+ PromptEcho	93.05	92.76	91.87	93.89	89.99	87.92 (+1.02)

TIIFBench

Fine-grained instruction following benchmark covering 40 dimensions across Basic, Advanced, and Real World categories (scored by GPT-4o). Each cell shows short/long description scores.

Model	Overall (short)	Overall (long)	Basic Avg	Advanced Avg	Real World
Z-Image (Baseline)	84.91	83.16	86.4/85.7	79.9/79.5	91.6/90.3
+ PromptEcho	88.50 (+3.6pp)	88.94 (+5.8pp)	90.3/89.7	83.3/86.8	95.2/93.3

Key improvements: Text rendering 93.2→99.1 (short), Style 73.3→83.3 (long), Real World +3.6pp (short), Overall +5.8pp (long).

Citation

@article{promptecho2025,
  title={PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning},
  author={PromptEcho Team},
  journal={arXiv preprint arXiv:2604.12652},
  year={2025}
}

Framework Versions

PEFT: 0.17.1
Diffusers: >= 0.30.0
PyTorch: >= 2.0

Downloads last month: -

Model tree for robotxx/prompt-echo-z-image

Base model

Tongyi-MAI/Z-Image

Adapter

(127)

this model

Paper for robotxx/prompt-echo-z-image

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Paper • 2604.12652 • Published 27 days ago • 1