FIRM-Edit-8B / README.md
nielsr's picture
nielsr HF Staff
Improve model card for FIRM-Edit-8B
fd04209 verified
|
raw
history blame
3.1 kB
---
base_model: Qwen/Qwen3-VL-8B-Instruct
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
- reward-model
- image-editing
- FIRM
- llama-factory
- generated_from_trainer
model-index:
- name: FIRM-Edit-8B
results: []
---
# FIRM-Edit-8B
[**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward)
**FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.
## Model Description
Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
1. **Execution**: Adherence to the editing instruction.
2. **Consistency**: Preservation of original content in unedited regions.
By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.
## Intended Uses & Limitations
- **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
- **Evaluation**: To serve as a metric for benchmarking the performance of image editing models.
## Training procedure
The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 10
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 2
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1.0
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.591 | 0.2182 | 500 | 0.5827 |
| 0.5605 | 0.4364 | 1000 | 0.5460 |
| 0.5252 | 0.6546 | 1500 | 0.5199 |
| 0.5075 | 0.8728 | 2000 | 0.5055 |
## Usage
To use the model as a reward server for RL training, you can use the script provided in the official repository:
```bash
# Launch the reward server
python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
```
## Citation
If you find this work useful, please cite:
```bibtex
@article{zhao2026trust,
title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
journal={arXiv preprint arXiv:2603.12247},
year={2026}
}
```