FIRM-Edit-8B / README.md
nielsr's picture
nielsr HF Staff
Improve model card for FIRM-Edit-8B
fd04209 verified
|
raw
history blame
3.1 kB
metadata
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
  - reward-model
  - image-editing
  - FIRM
  - llama-factory
  - generated_from_trainer
model-index:
  - name: FIRM-Edit-8B
    results: []

FIRM-Edit-8B

Project Page | Paper | GitHub

FIRM-Edit-8B is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on the FIRM-Edit-370K dataset. The model is part of the FIRM (Faithful Image Reward Modeling) framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.

Model Description

Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:

  1. Execution: Adherence to the editing instruction.
  2. Consistency: Preservation of original content in unedited regions.

By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.

Intended Uses & Limitations

  • Reward Modeling: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
  • Evaluation: To serve as a metric for benchmarking the performance of image editing models.

Training procedure

The model was fine-tuned using the LLaMA Factory framework.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 10
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 2
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1.0

Training results

Training Loss Epoch Step Validation Loss
0.591 0.2182 500 0.5827
0.5605 0.4364 1000 0.5460
0.5252 0.6546 1500 0.5199
0.5075 0.8728 2000 0.5055

Usage

To use the model as a reward server for RL training, you can use the script provided in the official repository:

# Launch the reward server
python editing/reward_server/reward_server_qwen3_vl_8b_sft.py

Citation

If you find this work useful, please cite:

@article{zhao2026trust,
  title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
  author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
  journal={arXiv preprint arXiv:2603.12247},
  year={2026}
}