| # Model Card for HP (High-Preference) Model | |
| This model is a specialized human preference scoring function that evaluates image quality based purely on visual aesthetics and human preferences, without relying on text-image alignment. See our paper [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002) for more details. | |
| ## Model Details | |
| ### Model Description | |
| The HP (High-Preference) model represents a paradigm shift in image quality evaluation by operating solely on the **image modality**. When text-image alignment reaches saturation ([ICT score](https://huggingface.co/8y/ICT) approaches 1), the HP model continues to differentiate image quality based on aesthetic and perceptual factors that matter to human viewers. | |
| **Core Philosophy**: Once an image adequately represents textual content, further quality improvements should be measured through pure visual assessment rather than text-image similarity metrics. | |
| ### Key Features | |
| - **Image-Only Evaluation**: No text input required, focuses purely on visual quality | |
| - **Human Preference Aligned**: Trained on preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset | |
| - **Complementary Design**: Works optimally when combined with [ICT model](https://huggingface.co/8y/ICT) for comprehensive evaluation | |
| ### Model Sources | |
| * **Repository:** [https://github.com/BarretBa/ICTHP](https://github.com/BarretBa/ICTHP) | |
| * **Paper:** [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002) | |
| * **Base Model:** CLIP-ViT-H-14 (Image Encoder + MLP Head) | |
| * **Training Dataset:** [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset (360,000 preference triplets) | |
| ## How to Get Started with the Model | |
| ### Installation | |
| ```bash | |
| pip install torch transformers pillow numpy open-clip-torch | |
| ``` | |
| ### Quick Start | |
| ```python | |
| # import | |
| import torch | |
| from transformers import CLIPModel, CLIPProcessor | |
| from PIL import Image | |
| import torch.nn as nn | |
| class MLP(nn.Module): | |
| def __init__(self): | |
| super().__init__() | |
| self.layers = nn.Sequential( | |
| nn.Linear(1024, 1024), nn.Dropout(0.2), | |
| nn.Linear(1024, 128), nn.Dropout(0.2), | |
| nn.Linear(128, 64), nn.Dropout(0.1), | |
| nn.Linear(64, 16), nn.Linear(16, 1) | |
| ) | |
| def forward(self, x): | |
| return self.layers(x) | |
| # load model | |
| device = "cuda" | |
| processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" | |
| model_pretrained_name_or_path = "8y/HP" | |
| processor = CLIPProcessor.from_pretrained(processor_name_or_path) | |
| backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone").eval().to(device) | |
| scorer = MLP() | |
| scorer.load_state_dict(torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin")) | |
| scorer = scorer.eval().to(device) | |
| def calc_hp_scores(images): | |
| # preprocess | |
| image_inputs = processor( | |
| images=images, | |
| return_tensors="pt" | |
| ).to(device) | |
| with torch.no_grad(): | |
| # extract features | |
| image_features = backbone.get_image_features(**image_inputs) | |
| # calculate hp scores | |
| hp_scores = torch.sigmoid(scorer(image_features)) | |
| return hp_scores.cpu().squeeze().tolist() | |
| pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")] | |
| scores = calc_hp_scores(pil_images) | |
| print(f"HP Scores: {scores}") | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| This model was trained on 36000 preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset. | |
| ## Citation | |
| ```bibtex | |
| @misc{ba2025enhancingrewardmodelshighquality, | |
| title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment}, | |
| author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen}, | |
| year={2025}, | |
| eprint={2507.19002}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2507.19002}, | |
| } | |
| ``` | |