---
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
- reward-model
- image-editing
- FIRM
- llama-factory
- generated_from_trainer
model-index:
- name: FIRM-Edit-8B
  results: []
---

# FIRM-Edit-8B

[**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward)

**FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.

## Model Description

Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
1. **Execution**: Adherence to the editing instruction.
2. **Consistency**: Preservation of original content in unedited regions.

By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.

## Intended Uses & Limitations

- **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
- **Evaluation**: To serve as a metric for benchmarking the performance of image editing models.

## Training procedure

The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 10
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 2
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.591         | 0.2182 | 500  | 0.5827          |
| 0.5605        | 0.4364 | 1000 | 0.5460          |
| 0.5252        | 0.6546 | 1500 | 0.5199          |
| 0.5075        | 0.8728 | 2000 | 0.5055          |

## Usage

To use the model as a reward server for RL training, you can use the script provided in the official repository:

```bash
# Launch the reward server
python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
```

## Citation

If you find this work useful, please cite:

```bibtex
@article{zhao2026trust,
  title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
  author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
  journal={arXiv preprint arXiv:2603.12247},
  year={2026}
}
```