base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
- reward-model
- image-editing
- FIRM
- llama-factory
- generated_from_trainer
model-index:
- name: FIRM-Edit-8B
results: []
FIRM-Edit-8B
Project Page | Paper | GitHub
FIRM-Edit-8B is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on the FIRM-Edit-370K dataset. The model is part of the FIRM (Faithful Image Reward Modeling) framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.
Model Description
Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
- Execution: Adherence to the editing instruction.
- Consistency: Preservation of original content in unedited regions.
By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.
Intended Uses & Limitations
- Reward Modeling: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
- Evaluation: To serve as a metric for benchmarking the performance of image editing models.
Training procedure
The model was fine-tuned using the LLaMA Factory framework.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 10
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 2
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1.0
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.591 | 0.2182 | 500 | 0.5827 |
| 0.5605 | 0.4364 | 1000 | 0.5460 |
| 0.5252 | 0.6546 | 1500 | 0.5199 |
| 0.5075 | 0.8728 | 2000 | 0.5055 |
Usage
To use the model as a reward server for RL training, you can use the script provided in the official repository:
# Launch the reward server
python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
Citation
If you find this work useful, please cite:
@article{zhao2026trust,
title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
journal={arXiv preprint arXiv:2603.12247},
year={2026}
}