nielsr HF Staff commited on
Commit
fd04209
·
verified ·
1 Parent(s): 0e7ffe6

Improve model card for FIRM-Edit-8B

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face.

I've updated your model card to document this artifact as **FIRM-Edit-8B**, the robust reward model for image editing introduced in your recent paper.

Changes include:
- Added `pipeline_tag: image-text-to-text` to the metadata for better discoverability.
- Linked the model to the ArXiv paper, project page, and GitHub repository.
- Added a model description explaining its role as a critic in the FIRM framework.
- Included the citation for your research.
- Retained your original training logs and hyperparameter information.

This will help researchers and practitioners find and use your reward model for reinforcement learning and evaluation in image editing.

Files changed (1) hide show
  1. README.md +38 -26
README.md CHANGED
@@ -1,39 +1,42 @@
1
  ---
 
2
  library_name: transformers
3
  license: other
4
- base_model: Qwen/Qwen3-VL-8B-Instruct
5
  tags:
 
 
 
6
  - llama-factory
7
- - full
8
  - generated_from_trainer
9
  model-index:
10
- - name: edit_evaluation_sft_202602030104
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # edit_evaluation_sft_202602030104
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the instruction_following_train_v3 and the consistency_train_v3 datasets.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 0.5041
22
 
23
- ## Model description
24
 
25
- More information needed
 
 
26
 
27
- ## Intended uses & limitations
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
32
-
33
- More information needed
34
 
35
  ## Training procedure
36
 
 
 
37
  ### Training hyperparameters
38
 
39
  The following hyperparameters were used during training:
@@ -41,12 +44,7 @@ The following hyperparameters were used during training:
41
  - train_batch_size: 10
42
  - eval_batch_size: 2
43
  - seed: 42
44
- - distributed_type: multi-GPU
45
- - num_devices: 8
46
  - gradient_accumulation_steps: 2
47
- - total_train_batch_size: 160
48
- - total_eval_batch_size: 16
49
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
50
  - lr_scheduler_type: cosine
51
  - lr_scheduler_warmup_ratio: 0.1
52
  - num_epochs: 1.0
@@ -60,10 +58,24 @@ The following hyperparameters were used during training:
60
  | 0.5252 | 0.6546 | 1500 | 0.5199 |
61
  | 0.5075 | 0.8728 | 2000 | 0.5055 |
62
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- ### Framework versions
65
 
66
- - Transformers 4.57.3
67
- - Pytorch 2.7.1+cu128
68
- - Datasets 4.0.0
69
- - Tokenizers 0.22.2
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen3-VL-8B-Instruct
3
  library_name: transformers
4
  license: other
5
+ pipeline_tag: image-text-to-text
6
  tags:
7
+ - reward-model
8
+ - image-editing
9
+ - FIRM
10
  - llama-factory
 
11
  - generated_from_trainer
12
  model-index:
13
+ - name: FIRM-Edit-8B
14
  results: []
15
  ---
16
 
17
+ # FIRM-Edit-8B
 
18
 
19
+ [**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward)
20
 
21
+ **FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines.
 
 
22
 
23
+ ## Model Description
24
 
25
+ Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives:
26
+ 1. **Execution**: Adherence to the editing instruction.
27
+ 2. **Consistency**: Preservation of original content in unedited regions.
28
 
29
+ By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing.
30
 
31
+ ## Intended Uses & Limitations
32
 
33
+ - **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing.
34
+ - **Evaluation**: To serve as a metric for benchmarking the performance of image editing models.
 
35
 
36
  ## Training procedure
37
 
38
+ The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework.
39
+
40
  ### Training hyperparameters
41
 
42
  The following hyperparameters were used during training:
 
44
  - train_batch_size: 10
45
  - eval_batch_size: 2
46
  - seed: 42
 
 
47
  - gradient_accumulation_steps: 2
 
 
 
48
  - lr_scheduler_type: cosine
49
  - lr_scheduler_warmup_ratio: 0.1
50
  - num_epochs: 1.0
 
58
  | 0.5252 | 0.6546 | 1500 | 0.5199 |
59
  | 0.5075 | 0.8728 | 2000 | 0.5055 |
60
 
61
+ ## Usage
62
+
63
+ To use the model as a reward server for RL training, you can use the script provided in the official repository:
64
+
65
+ ```bash
66
+ # Launch the reward server
67
+ python editing/reward_server/reward_server_qwen3_vl_8b_sft.py
68
+ ```
69
+
70
+ ## Citation
71
 
72
+ If you find this work useful, please cite:
73
 
74
+ ```bibtex
75
+ @article{zhao2026trust,
76
+ title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation},
77
+ author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue},
78
+ journal={arXiv preprint arXiv:2603.12247},
79
+ year={2026}
80
+ }
81
+ ```