CodeGoat24
/

UnifiedReward-Flex-qwen3vl-2b

Safetensors

qwen3_vl

Model card Files Files and versions

xet

Community

Add pipeline_tag, library_name and paper metadata

by nielsr HF Staff - opened Feb 5

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+17

-10

Files changed (1) hide show

README.md +17 -10

README.md CHANGED Viewed

@@ -1,25 +1,32 @@
 ---
-license: mit
 base_model:
 - CodeGoat24/UnifiedReward-Think-qwen3vl-2b
 datasets:
 - CodeGoat24/UnifiedReward-Flex-SFT-90K
 ---
-# Model Summary
-**UnifiedReward-Flex-qwen3vl-2b** is a **unified personalized reward model for vision generation** that couples reward modeling with flexible and context-adaptive reasoning!!
-🚀 The inference code is available at [Github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Flex).
-For further details, please refer to the following resources:
-- 📰 Paper: https://arxiv.org/abs/2602.02380
-- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/flex
-- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-flex
-- 🤗 Dataset: https://huggingface.co/datasets/CodeGoat24/UnifiedReward-Flex-SFT-90K
-- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)
 ## Citation

 ---
 base_model:
 - CodeGoat24/UnifiedReward-Think-qwen3vl-2b
 datasets:
 - CodeGoat24/UnifiedReward-Flex-SFT-90K
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
+# UnifiedReward-Flex-qwen3vl-2b
+**UnifiedReward-Flex-qwen3vl-2b** is a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning.
+The model was introduced in the paper [Unified Personalized Reward Model for Vision Generation](https://huggingface.co/papers/2602.02380).
+## Model Summary
+UnifiedReward-Flex addresses the limitations of traditional "one-size-fits-all" reward models by dynamically constructing hierarchical assessments based on content-specific visual cues. It follows a two-stage training process:
+1.  **SFT**: Distilling structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap Supervised Fine-Tuning, equipping the model with flexible and context-adaptive reasoning.
+2.  **DPO**: Performing Direct Preference Optimization on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
+## Resources
+- **📰 Paper:** [Unified Personalized Reward Model for Vision Generation](https://huggingface.co/papers/2602.02380)
+- **🪐 Project Page:** [https://codegoat24.github.io/UnifiedReward/flex](https://codegoat24.github.io/UnifiedReward/flex)
+- **🚀 Code:** [GitHub Repository](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Flex)
+- **🤗 Model Collections:** [UnifiedReward-Flex Collection](https://huggingface.co/collections/CodeGoat24/unifiedreward-flex)
+- **🤗 Dataset:** [UnifiedReward-Flex-SFT-90K](https://huggingface.co/datasets/CodeGoat24/UnifiedReward-Flex-SFT-90K)
 ## Citation