PromptEcho LoRA for Z-Image

LoRA adapter for Z-Image, fine-tuned with PromptEcho — an annotation-free reinforcement learning method that uses a frozen vision-language model (VLM) to provide reward signals. Instead of relying on human preference labels, PromptEcho computes the negative cross-entropy loss of re-generating the text prompt conditioned on the generated image via a frozen VLM. This log-probability score serves as a deterministic, single-forward-pass reward that is optimized through GRPO/AWM-style policy updates. The resulting model produces images that more faithfully follow complex, detail-rich text prompts.

Model Details

How It Works

  1. The image generation model produces multiple images from a sampled text caption.
  2. Each image is fed into a frozen VLM alongside a fixed query (e.g., "Describe this image in detail."). The VLM computes the log-probability of the caption tokens, yielding the reward: R = -CrossEntropyLoss(caption | image, query).
  3. A group advantage is computed and used to update the generation model via AWM/GRPO policy optimization.

No human annotations are required at any stage.

Usage

Requirements

pip install torch diffusers peft accelerate transformers pillow tqdm

Quick Start

import torch
from diffusers import AutoencoderKL
from diffusers.models import ZImageTransformer2DModel
from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base_model_path = "Tongyi-MAI/Z-Image"
lora_path = "robotxx/prompt-echo-z-image"
device = "cuda"
dtype = torch.bfloat16

# Load model components
vae = AutoencoderKL.from_pretrained(base_model_path, subfolder="vae", torch_dtype=dtype)
transformer = ZImageTransformer2DModel.from_pretrained(base_model_path, subfolder="transformer", torch_dtype=dtype)
text_encoder = AutoModel.from_pretrained(base_model_path, subfolder="text_encoder", torch_dtype=dtype, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_path, subfolder="tokenizer", trust_remote_code=True)
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(base_model_path, subfolder="scheduler")

# Load and merge LoRA
transformer = PeftModel.from_pretrained(transformer, lora_path)
transformer = transformer.merge_and_unload()

# Move to GPU
vae.to(device, dtype=torch.float32)
text_encoder.to(device, dtype=dtype)
transformer.to(device, dtype=dtype)

Inference Script

For batch generation with the included inference script:

# Single GPU
python inference/infer_z_image.py \
    --base_model_path Tongyi-MAI/Z-Image \
    --lora_path ./z-image-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_z_image

# Multi-GPU (8x)
accelerate launch --num_processes 8 inference/infer_z_image.py \
    --base_model_path Tongyi-MAI/Z-Image \
    --lora_path ./z-image-prompt_echo_lora \
    --caption_jsonl ./metadata.jsonl \
    --output_dir ./output_z_image

Inference Parameters

Parameter Default Description
num_inference_steps 30 Number of denoising steps
guidance_scale 4.0 Classifier-free guidance scale
resolution 1024 Output image resolution (height = width)
max_sequence_length 512 Maximum prompt token length
batch_size 1 Per-GPU batch size

Training Details

Training Procedure

The adapter was trained using PromptEcho, an annotation-free RL framework:

  • RL algorithm: AWM/GRPO-style policy optimization
  • Reward signal: Negative cross-entropy loss from a frozen VLM conditioned on the generated image
  • Training regime: bf16 mixed precision
  • LoRA initialization: Gaussian

Training Hyperparameters

Hyperparameter Value
LoRA rank (r) 64
LoRA alpha 128
LoRA dropout 0.0
Bias none

Evaluation

Important: No benchmark-specific training was performed. The training data consists of ~200-word dense captions generated by Qwen3-VL-32B, which differ significantly in distribution from all evaluation benchmarks below. All improvements reflect genuine generalization of prompt-following ability.

DenseAlignBench (Ours)

Pairwise evaluation using Gemini-3-flash-preview as VLM judge, with random image-order shuffling to mitigate position bias.

Comparison Win Rate Baseline Win Rate Tie Rate Net Advantage
Z-Image + PromptEcho vs Baseline 61.5% 34.7% 3.8% +26.8pp

GenEval

Structured evaluation covering single/two objects, counting, colors, position, and attribute binding.

Model Single Obj. Two Obj. Counting Colors Position Attr. Bind. Overall
Z-Image (Baseline) 0.99 0.91 0.75 0.86 0.41 0.59 0.75
+ PromptEcho 1.00 0.94 0.85 0.86 0.52 0.73 0.82 (+6.5pp)

Key improvements: Attribute Binding +14pp, Position +11pp, Counting +10pp.

DPG-Bench

Dense prompt semantic alignment evaluation across global, entity, attribute, relation, and other categories (scored by mPLUG-large VQA).

Model Global Entity Attribute Relation Other Overall
Z-Image (Baseline) 91.60 91.54 90.32 92.76 91.94 86.90
+ PromptEcho 93.05 92.76 91.87 93.89 89.99 87.92 (+1.02)

TIIFBench

Fine-grained instruction following benchmark covering 40 dimensions across Basic, Advanced, and Real World categories (scored by GPT-4o). Each cell shows short/long description scores.

Model Overall (short) Overall (long) Basic Avg Advanced Avg Real World
Z-Image (Baseline) 84.91 83.16 86.4/85.7 79.9/79.5 91.6/90.3
+ PromptEcho 88.50 (+3.6pp) 88.94 (+5.8pp) 90.3/89.7 83.3/86.8 95.2/93.3

Key improvements: Text rendering 93.2→99.1 (short), Style 73.3→83.3 (long), Real World +3.6pp (short), Overall +5.8pp (long).

Citation

@article{promptecho2025,
  title={PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning},
  author={PromptEcho Team},
  journal={arXiv preprint arXiv:2604.12652},
  year={2025}
}

Framework Versions

  • PEFT: 0.17.1
  • Diffusers: >= 0.30.0
  • PyTorch: >= 2.0
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robotxx/prompt-echo-z-image

Adapter
(127)
this model

Paper for robotxx/prompt-echo-z-image