| | --- |
| | base_model: Qwen/Qwen3-VL-8B-Instruct |
| | library_name: transformers |
| | license: other |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - reward-model |
| | - image-editing |
| | - FIRM |
| | - llama-factory |
| | - generated_from_trainer |
| | model-index: |
| | - name: FIRM-Edit-8B |
| | results: [] |
| | --- |
| | |
| | # FIRM-Edit-8B |
| |
|
| | [**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward) |
| |
|
| | **FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines. |
| |
|
| | ## Model Description |
| |
|
| | Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives: |
| | 1. **Execution**: Adherence to the editing instruction. |
| | 2. **Consistency**: Preservation of original content in unedited regions. |
| |
|
| | By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing. |
| |
|
| | ## Intended Uses & Limitations |
| |
|
| | - **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing. |
| | - **Evaluation**: To serve as a metric for benchmarking the performance of image editing models. |
| |
|
| | ## Training procedure |
| |
|
| | The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework. |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 1e-05 |
| | - train_batch_size: 10 |
| | - eval_batch_size: 2 |
| | - seed: 42 |
| | - gradient_accumulation_steps: 2 |
| | - lr_scheduler_type: cosine |
| | - lr_scheduler_warmup_ratio: 0.1 |
| | - num_epochs: 1.0 |
| | |
| | ### Training results |
| | |
| | | Training Loss | Epoch | Step | Validation Loss | |
| | |:-------------:|:------:|:----:|:---------------:| |
| | | 0.591 | 0.2182 | 500 | 0.5827 | |
| | | 0.5605 | 0.4364 | 1000 | 0.5460 | |
| | | 0.5252 | 0.6546 | 1500 | 0.5199 | |
| | | 0.5075 | 0.8728 | 2000 | 0.5055 | |
| | |
| | ## Usage |
| | |
| | To use the model as a reward server for RL training, you can use the script provided in the official repository: |
| | |
| | ```bash |
| | # Launch the reward server |
| | python editing/reward_server/reward_server_qwen3_vl_8b_sft.py |
| | ``` |
| | |
| | ## Citation |
| | |
| | If you find this work useful, please cite: |
| | |
| | ```bibtex |
| | @article{zhao2026trust, |
| | title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation}, |
| | author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue}, |
| | journal={arXiv preprint arXiv:2603.12247}, |
| | year={2026} |
| | } |
| | ``` |