Instructions to use VisionXLab/FIRM-Edit-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VisionXLab/FIRM-Edit-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VisionXLab/FIRM-Edit-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("VisionXLab/FIRM-Edit-8B")
model = AutoModelForImageTextToText.from_pretrained("VisionXLab/FIRM-Edit-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use VisionXLab/FIRM-Edit-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VisionXLab/FIRM-Edit-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Edit-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VisionXLab/FIRM-Edit-8B

SGLang

How to use VisionXLab/FIRM-Edit-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VisionXLab/FIRM-Edit-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Edit-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VisionXLab/FIRM-Edit-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Edit-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VisionXLab/FIRM-Edit-8B with Docker Model Runner:
```
docker model run hf.co/VisionXLab/FIRM-Edit-8B
```

nielsr HF Staff commited on Mar 13

Commit

fd04209

verified ·

1 Parent(s): 0e7ffe6

Improve model card for FIRM-Edit-8B

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face.

I've updated your model card to document this artifact as **FIRM-Edit-8B**, the robust reward model for image editing introduced in your recent paper.

Changes include:
- Added `pipeline_tag: image-text-to-text` to the metadata for better discoverability.
- Linked the model to the ArXiv paper, project page, and GitHub repository.
- Added a model description explaining its role as a critic in the FIRM framework.
- Included the citation for your research.
- Retained your original training logs and hyperparameter information.

This will help researchers and practitioners find and use your reward model for reinforcement learning and evaluation in image editing.

Files changed (1) hide show

README.md +38 -26

README.md CHANGED Viewed

@@ -1,39 +1,42 @@
 ---
 library_name: transformers
 license: other
-base_model: Qwen/Qwen3-VL-8B-Instruct
 tags:
 - llama-factory
-- full
 - generated_from_trainer
 model-index:
-- name: edit_evaluation_sft_202602030104
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# edit_evaluation_sft_202602030104
-This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the instruction_following_train_v3 and the consistency_train_v3 datasets.
-It achieves the following results on the evaluation set:
-- Loss: 0.5041
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -41,12 +44,7 @@ The following hyperparameters were used during training:
 - train_batch_size: 10
 - eval_batch_size: 2
 - seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
 - gradient_accumulation_steps: 2
-- total_train_batch_size: 160
-- total_eval_batch_size: 16
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1.0
@@ -60,10 +58,24 @@ The following hyperparameters were used during training:
 | 0.5252        | 0.6546 | 1500 | 0.5199          |
 | 0.5075        | 0.8728 | 2000 | 0.5055          |
-### Framework versions
-- Transformers 4.57.3
-- Pytorch 2.7.1+cu128
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 ---
+base_model: Qwen/Qwen3-VL-8B-Instruct
 library_name: transformers
 license: other
+pipeline_tag: image-text-to-text
 tags:
+- reward-model
+- image-editing
+- FIRM
 - llama-factory
 - generated_from_trainer
 model-index:
+- name: FIRM-Edit-8B
   results: []
 ---
+# FIRM-Edit-8B
+[**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward)
+**FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.
+## Model Description
+Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
+1. **Execution**: Adherence to the editing instruction.
+2. **Consistency**: Preservation of original content in unedited regions.
+By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.
+## Intended Uses & Limitations
+- **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
+- **Evaluation**: To serve as a metric for benchmarking the performance of image editing models.
 ## Training procedure
+The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - train_batch_size: 10
 - eval_batch_size: 2
 - seed: 42
 - gradient_accumulation_steps: 2
 - lr_scheduler_type: cosine
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1.0
 | 0.5252        | 0.6546 | 1500 | 0.5199          |
 | 0.5075        | 0.8728 | 2000 | 0.5055          |
+## Usage
+To use the model as a reward server for RL training, you can use the script provided in the official repository:
+```bash
+# Launch the reward server
+python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
+```
+## Citation
+If you find this work useful, please cite:
+```bibtex
+@article{zhao2026trust,
+  title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
+  author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
+  journal={arXiv preprint arXiv:2603.12247},
+  year={2026}
+}
+```