HPSv3-PlusPlus / README.md
Junjun2333's picture
Upload README.md with huggingface_hub
ac26578 verified
|
Raw
History Blame Contribute Delete
1.81 kB
---
license: apache-2.0
base_model:
- Qwen/Qwen3-VL-8B-Instruct
pipeline_tag: image-text-to-text
tags:
- reward-model
- text-to-image
- human-preference
- rlhf
---
# HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities
HPSv3++ is a **capability-aware and RL-iteration-aware** text-to-image (T2I) reward model, built on the `Qwen/Qwen3-VL-8B-Instruct` backbone with a Capability Encoder, a FiLM conditioning head, and a three-layer RankNet reward head.
A Capability Encoder implicitly infers the generative ability of the model that produced an image, while the RL iteration step is supplied as an explicit condition; the two are jointly modulated through FiLM so that a single reward model produces calibrated preference scores across generators of differing capability and different stages of RL optimization.
The training/evaluation dataset, HPDv3++, is released separately: [Junjun2333/HPDv3-PlusPlus](https://huggingface.co/datasets/Junjun2333/HPDv3-PlusPlus).
## Files
| File | Description |
|---|---|
| `hpsv3++.pth` | Final HPSv3++ reward-model weights (17.6 GB) |
| `config.json` | Model configuration |
## Conditioning at inference
- **Model capability** is inferred implicitly from the image; you do not pass it in.
- **RL iteration** is passed explicitly as a normalized scalar in `[0, 1]`.
- General preference scoring / ranking: use `0.0` (pre-RL setting).
- As the reward inside T2I RL fine-tuning: ramp the iteration condition linearly from `0.3` to `1.0` over training (the setting used in the paper).
- Use the mean (`mu`) output as the scalar reward.
## Citation
```bibtex
@misc{hpsv3pp,
title = {HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities},
author = {HPSv3++ Team},
year = {2026}
}
```