|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: image-feature-extraction |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Model Card for HP (High-Preference) Model |
|
|
|
|
|
This model is a specialized human preference scoring function that evaluates image quality based purely on visual aesthetics and human preferences, without relying on text-image alignment. See our paper [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002) for more details. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
The HP (High-Preference) model represents a paradigm shift in image quality evaluation by operating solely on the **image modality**. When text-image alignment reaches saturation ([ICT score](https://huggingface.co/8y/ICT) approaches 1), the HP model continues to differentiate image quality based on aesthetic and perceptual factors that matter to human viewers. |
|
|
|
|
|
**Core Philosophy**: Once an image adequately represents textual content, further quality improvements should be measured through pure visual assessment rather than text-image similarity metrics. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Image-Only Evaluation**: No text input required, focuses purely on visual quality |
|
|
- **Human Preference Aligned**: Trained on preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset |
|
|
- **Complementary Design**: Works optimally when combined with [ICT model](https://huggingface.co/8y/ICT) for comprehensive evaluation |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
* **Repository:** [https://github.com/BarretBa/ICTHP](https://github.com/BarretBa/ICTHP) |
|
|
* **Paper:** [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002) |
|
|
* **Base Model:** CLIP-ViT-H-14 (Image Encoder + MLP Head) |
|
|
* **Training Dataset:** [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset (360,000 preference triplets) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers pillow numpy open-clip-torch |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
# import |
|
|
import torch |
|
|
from transformers import CLIPModel, CLIPProcessor |
|
|
from PIL import Image |
|
|
import torch.nn as nn |
|
|
|
|
|
class MLP(nn.Module): |
|
|
def __init__(self): |
|
|
super().__init__() |
|
|
self.layers = nn.Sequential( |
|
|
nn.Linear(1024, 1024), nn.Dropout(0.2), |
|
|
nn.Linear(1024, 128), nn.Dropout(0.2), |
|
|
nn.Linear(128, 64), nn.Dropout(0.1), |
|
|
nn.Linear(64, 16), nn.Linear(16, 1) |
|
|
) |
|
|
def forward(self, x): |
|
|
return self.layers(x) |
|
|
|
|
|
# load model |
|
|
device = "cuda" |
|
|
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" |
|
|
model_pretrained_name_or_path = "8y/HP" |
|
|
|
|
|
processor = CLIPProcessor.from_pretrained(processor_name_or_path) |
|
|
backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone").eval().to(device) |
|
|
scorer = MLP() |
|
|
scorer.load_state_dict(torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin")) |
|
|
scorer = scorer.eval().to(device) |
|
|
|
|
|
def calc_hp_scores(images): |
|
|
# preprocess |
|
|
image_inputs = processor( |
|
|
images=images, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
# extract features |
|
|
image_features = backbone.get_image_features(**image_inputs) |
|
|
|
|
|
# calculate hp scores |
|
|
hp_scores = torch.sigmoid(scorer(image_features)) |
|
|
|
|
|
return hp_scores.cpu().squeeze().tolist() |
|
|
|
|
|
pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")] |
|
|
scores = calc_hp_scores(pil_images) |
|
|
print(f"HP Scores: {scores}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
This model was trained on 36000 preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{ba2025enhancingrewardmodelshighquality, |
|
|
title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment}, |
|
|
author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen}, |
|
|
year={2025}, |
|
|
eprint={2507.19002}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2507.19002}, |
|
|
} |
|
|
``` |