VisionXLab
/

FIRM-Edit-8B

Image-Text-to-Text

Generated from Trainer

Model card Files Files and versions

Metrics Training metrics Community

FIRM-Edit-8B / README.md

nielsr's picture

nielsr HF Staff

Improve model card for FIRM-Edit-8B

fd04209 verified 1 day ago

|

3.1 kB

	---
	base_model: Qwen/Qwen3-VL-8B-Instruct
	library_name: transformers
	license: other
	pipeline_tag: image-text-to-text
	tags:
	- reward-model
	- image-editing
	- FIRM
	- llama-factory
	- generated_from_trainer
	model-index:
	- name: FIRM-Edit-8B
	results: []
	---

	# FIRM-Edit-8B

	[Project Page](https://firm-reward.github.io/) \| [Paper](https://arxiv.org/abs/2603.12247) \| [GitHub](https://github.com/VisionXLab/FIRM-Reward)

	FIRM-Edit-8B is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the FIRM-Edit-370K dataset. The model is part of the FIRM (Faithful Image Reward Modeling) framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.

	## Model Description

	Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
	1. Execution: Adherence to the editing instruction.
	2. Consistency: Preservation of original content in unedited regions.

	By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.

	## Intended Uses & Limitations

	- Reward Modeling: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
	- Evaluation: To serve as a metric for benchmarking the performance of image editing models.

	## Training procedure

	The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework.

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 10
	- eval_batch_size: 2
	- seed: 42
	- gradient_accumulation_steps: 2
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 1.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 0.591 \| 0.2182 \| 500 \| 0.5827 \|
	\| 0.5605 \| 0.4364 \| 1000 \| 0.5460 \|
	\| 0.5252 \| 0.6546 \| 1500 \| 0.5199 \|
	\| 0.5075 \| 0.8728 \| 2000 \| 0.5055 \|

	## Usage

	To use the model as a reward server for RL training, you can use the script provided in the official repository:

	```bash
	# Launch the reward server
	python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
	```

	## Citation

	If you find this work useful, please cite:

	```bibtex
	@article{zhao2026trust,
	title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
	author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
	journal={arXiv preprint arXiv:2603.12247},
	year={2026}
	}
	```