Improve model card: add paper link, metadata, and description
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,57 +1,57 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
license: other
|
| 4 |
-
|
| 5 |
tags:
|
| 6 |
- llama-factory
|
| 7 |
-
-
|
|
|
|
|
|
|
| 8 |
- generated_from_trainer
|
| 9 |
model-index:
|
| 10 |
-
- name: gen_reward_sft
|
| 11 |
results: []
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 16 |
-
|
| 17 |
-
# gen_reward_sft
|
| 18 |
|
| 19 |
-
This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
| 20 |
-
It achieves the following results on the evaluation set:
|
| 21 |
-
- Loss: 0.5180
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
##
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
## Training
|
| 36 |
|
| 37 |
-
### Training
|
| 38 |
|
| 39 |
The following hyperparameters were used during training:
|
| 40 |
-
- learning_rate: 1e-05
|
| 41 |
-
- train_batch_size: 5
|
| 42 |
-
- eval_batch_size: 2
|
| 43 |
-
- seed: 42
|
| 44 |
-
- distributed_type: multi-GPU
|
| 45 |
-
- num_devices: 8
|
| 46 |
-
- gradient_accumulation_steps: 2
|
| 47 |
-
- total_train_batch_size: 80
|
| 48 |
-
- total_eval_batch_size: 16
|
| 49 |
-
- optimizer:
|
| 50 |
-
- lr_scheduler_type: cosine
|
| 51 |
-
- lr_scheduler_warmup_ratio: 0.1
|
| 52 |
-
- num_epochs: 1.0
|
| 53 |
-
|
| 54 |
-
### Training
|
| 55 |
|
| 56 |
| Training Loss | Epoch | Step | Validation Loss |
|
| 57 |
|:-------------:|:------:|:----:|:---------------:|
|
|
@@ -63,10 +63,15 @@ The following hyperparameters were used during training:
|
|
| 63 |
| 0.5155 | 0.8279 | 3000 | 0.5207 |
|
| 64 |
| 0.5106 | 0.9659 | 3500 | 0.5181 |
|
| 65 |
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: Qwen/Qwen3-VL-8B-Instruct
|
| 3 |
library_name: transformers
|
| 4 |
license: other
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
tags:
|
| 7 |
- llama-factory
|
| 8 |
+
- reward-model
|
| 9 |
+
- image-generation
|
| 10 |
+
- reinforcement-learning
|
| 11 |
- generated_from_trainer
|
| 12 |
model-index:
|
| 13 |
+
- name: FIRM-Gen-8B (gen_reward_sft)
|
| 14 |
results: []
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# FIRM-Gen-8B (gen_reward_sft)
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) and serves as a robust reward model (critic) for text-to-image generation. It was introduced as part of the **FIRM (Faithful Image Reward Modeling)** framework in the paper "[Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation](https://huggingface.co/papers/2603.12247)".
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
- **Paper:** [Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation](https://huggingface.co/papers/2603.12247)
|
| 22 |
+
- **Project Page:** [firm-reward.github.io](https://firm-reward.github.io/)
|
| 23 |
+
- **Repository:** [VisionXLab/FIRM-Reward](https://github.com/VisionXLab/FIRM-Reward)
|
| 24 |
|
| 25 |
+
## Model Description
|
| 26 |
|
| 27 |
+
FIRM-Gen-8B is specifically trained on the **FIRM-Gen-293K** dataset to provide accurate and reliable guidance for faithful image generation. It addresses the common issue of reward hacking and hallucinations in Multimodal Large Language Models (MLLMs) by using a "plan-then-score" pipeline to evaluate how well a generated image follows complex instructions.
|
| 28 |
|
| 29 |
+
Within a Reinforcement Learning (RL) pipeline, this model acts as the critic, assigning scores that guide the optimization of generative models (like Stable Diffusion 3.5 or FLUX) toward better instruction adherence and visual fidelity.
|
| 30 |
|
| 31 |
+
## Intended Uses & Limitations
|
| 32 |
|
| 33 |
+
This model is intended to be used as a reward signal in RL pipelines or as an evaluation metric for text-to-image alignment. It is compatible with the `transformers` library and can be deployed using the reward server scripts found in the official repository.
|
| 34 |
|
| 35 |
+
## Training Procedure
|
| 36 |
|
| 37 |
+
### Training Hyperparameters
|
| 38 |
|
| 39 |
The following hyperparameters were used during training:
|
| 40 |
+
- **learning_rate:** 1e-05
|
| 41 |
+
- **train_batch_size:** 5
|
| 42 |
+
- **eval_batch_size:** 2
|
| 43 |
+
- **seed:** 42
|
| 44 |
+
- **distributed_type:** multi-GPU
|
| 45 |
+
- **num_devices:** 8
|
| 46 |
+
- **gradient_accumulation_steps:** 2
|
| 47 |
+
- **total_train_batch_size:** 80
|
| 48 |
+
- **total_eval_batch_size:** 16
|
| 49 |
+
- **optimizer:** AdamW
|
| 50 |
+
- **lr_scheduler_type:** cosine
|
| 51 |
+
- **lr_scheduler_warmup_ratio:** 0.1
|
| 52 |
+
- **num_epochs:** 1.0
|
| 53 |
+
|
| 54 |
+
### Training Results
|
| 55 |
|
| 56 |
| Training Loss | Epoch | Step | Validation Loss |
|
| 57 |
|:-------------:|:------:|:----:|:---------------:|
|
|
|
|
| 63 |
| 0.5155 | 0.8279 | 3000 | 0.5207 |
|
| 64 |
| 0.5106 | 0.9659 | 3500 | 0.5181 |
|
| 65 |
|
| 66 |
+
## Citation
|
| 67 |
|
| 68 |
+
If you find this model useful, please cite:
|
| 69 |
|
| 70 |
+
```bibtex
|
| 71 |
+
@article{zhao2025trust,
|
| 72 |
+
title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
|
| 73 |
+
author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
|
| 74 |
+
journal={arXiv preprint arXiv:2603.12247},
|
| 75 |
+
year={2025}
|
| 76 |
+
}
|
| 77 |
+
```
|