--- base_model: Qwen/Qwen3-VL-8B-Instruct library_name: transformers license: other pipeline_tag: image-text-to-text tags: - reward-model - image-editing - FIRM - llama-factory - generated_from_trainer model-index: - name: FIRM-Edit-8B results: [] --- # FIRM-Edit-8B [**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward) **FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines. ## Model Description Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives: 1. **Execution**: Adherence to the editing instruction. 2. **Consistency**: Preservation of original content in unedited regions. By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing. ## Intended Uses & Limitations - **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing. - **Evaluation**: To serve as a metric for benchmarking the performance of image editing models. ## Training procedure The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 10 - eval_batch_size: 2 - seed: 42 - gradient_accumulation_steps: 2 - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 1.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.591 | 0.2182 | 500 | 0.5827 | | 0.5605 | 0.4364 | 1000 | 0.5460 | | 0.5252 | 0.6546 | 1500 | 0.5199 | | 0.5075 | 0.8728 | 2000 | 0.5055 | ## Usage To use the model as a reward server for RL training, you can use the script provided in the official repository: ```bash # Launch the reward server python editing/reward_server/reward_server_qwen3_vl_8b_sft.py ``` ## Citation If you find this work useful, please cite: ```bibtex @article{zhao2026trust, title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation}, author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue}, journal={arXiv preprint arXiv:2603.12247}, year={2026} } ```