Instructions to use VisionXLab/FIRM-Gen-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VisionXLab/FIRM-Gen-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VisionXLab/FIRM-Gen-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("VisionXLab/FIRM-Gen-8B")
model = AutoModelForImageTextToText.from_pretrained("VisionXLab/FIRM-Gen-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use VisionXLab/FIRM-Gen-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VisionXLab/FIRM-Gen-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Gen-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VisionXLab/FIRM-Gen-8B

SGLang

How to use VisionXLab/FIRM-Gen-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VisionXLab/FIRM-Gen-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Gen-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VisionXLab/FIRM-Gen-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VisionXLab/FIRM-Gen-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VisionXLab/FIRM-Gen-8B with Docker Model Runner:
```
docker model run hf.co/VisionXLab/FIRM-Gen-8B
```

nielsr HF Staff commited on Mar 13

Commit

0de124d

verified ·

1 Parent(s): 7e157e9

Improve model card: add paper link, metadata, and description

Browse files

Hi! I'm Niels from the Hugging Face community team.

I've improved the model card for this checkpoint. Based on the associated research paper, this model is **FIRM-Gen-8B**, a robust reward model designed to act as a critic for faithful text-to-image generation within the FIRM (Faithful Image Reward Modeling) framework.

Changes include:
- Added the `image-text-to-text` pipeline tag to improve discoverability.
- Linked the model to the research paper, project page, and official GitHub repository.
- Provided a descriptive summary of the model's purpose and its role in reducing hallucinations during reinforcement learning.
- Maintained the existing training hyperparameters and results table.

Files changed (1) hide show

README.md +43 -38

README.md CHANGED Viewed

@@ -1,57 +1,57 @@
 ---
 library_name: transformers
 license: other
-base_model: Qwen/Qwen3-VL-8B-Instruct
 tags:
 - llama-factory
-- full
 - generated_from_trainer
 model-index:
-- name: gen_reward_sft
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# gen_reward_sft
-This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the gen_reward_sft dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.5180
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 5
-- eval_batch_size: 2
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 80
-- total_eval_batch_size: 16
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 1.0
-### Training results
 | Training Loss | Epoch  | Step | Validation Loss |
 |:-------------:|:------:|:----:|:---------------:|
@@ -63,10 +63,15 @@ The following hyperparameters were used during training:
 | 0.5155        | 0.8279 | 3000 | 0.5207          |
 | 0.5106        | 0.9659 | 3500 | 0.5181          |
-### Framework versions
-- Transformers 4.57.3
-- Pytorch 2.7.1+cu128
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 ---
+base_model: Qwen/Qwen3-VL-8B-Instruct
 library_name: transformers
 license: other
+pipeline_tag: image-text-to-text
 tags:
 - llama-factory
+- reward-model
+- image-generation
+- reinforcement-learning
 - generated_from_trainer
 model-index:
+- name: FIRM-Gen-8B (gen_reward_sft)
   results: []
 ---
+# FIRM-Gen-8B (gen_reward_sft)
+This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) and serves as a robust reward model (critic) for text-to-image generation. It was introduced as part of the **FIRM (Faithful Image Reward Modeling)** framework in the paper "[Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation](https://huggingface.co/papers/2603.12247)".
+- **Paper:** [Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation](https://huggingface.co/papers/2603.12247)
+- **Project Page:** [firm-reward.github.io](https://firm-reward.github.io/)
+- **Repository:** [VisionXLab/FIRM-Reward](https://github.com/VisionXLab/FIRM-Reward)
+## Model Description
+FIRM-Gen-8B is specifically trained on the **FIRM-Gen-293K** dataset to provide accurate and reliable guidance for faithful image generation. It addresses the common issue of reward hacking and hallucinations in Multimodal Large Language Models (MLLMs) by using a "plan-then-score" pipeline to evaluate how well a generated image follows complex instructions.
+Within a Reinforcement Learning (RL) pipeline, this model acts as the critic, assigning scores that guide the optimization of generative models (like Stable Diffusion 3.5 or FLUX) toward better instruction adherence and visual fidelity.
+## Intended Uses & Limitations
+This model is intended to be used as a reward signal in RL pipelines or as an evaluation metric for text-to-image alignment. It is compatible with the `transformers` library and can be deployed using the reward server scripts found in the official repository.
+## Training Procedure
+### Training Hyperparameters
 The following hyperparameters were used during training:
+- **learning_rate:** 1e-05
+- **train_batch_size:** 5
+- **eval_batch_size:** 2
+- **seed:** 42
+- **distributed_type:** multi-GPU
+- **num_devices:** 8
+- **gradient_accumulation_steps:** 2
+- **total_train_batch_size:** 80
+- **total_eval_batch_size:** 16
+- **optimizer:** AdamW
+- **lr_scheduler_type:** cosine
+- **lr_scheduler_warmup_ratio:** 0.1
+- **num_epochs:** 1.0
+### Training Results
 | Training Loss | Epoch  | Step | Validation Loss |
 |:-------------:|:------:|:----:|:---------------:|
 | 0.5155        | 0.8279 | 3000 | 0.5207          |
 | 0.5106        | 0.9659 | 3500 | 0.5181          |
+## Citation
+If you find this model useful, please cite:
+```bibtex
+@article{zhao2025trust,
+  title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
+  author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
+  journal={arXiv preprint arXiv:2603.12247},
+  year={2025}
+}
+```