8y
/

HP

HP / README.md

Update README.md

7fcd3d8 verified 6 months ago

4.16 kB

	# Model Card for HP (High-Preference) Model

	This model is a specialized human preference scoring function that evaluates image quality based purely on visual aesthetics and human preferences, without relying on text-image alignment. See our paper [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002) for more details.

	## Model Details

	### Model Description

	The HP (High-Preference) model represents a paradigm shift in image quality evaluation by operating solely on the image modality. When text-image alignment reaches saturation ([ICT score](https://huggingface.co/8y/ICT) approaches 1), the HP model continues to differentiate image quality based on aesthetic and perceptual factors that matter to human viewers.

	Core Philosophy: Once an image adequately represents textual content, further quality improvements should be measured through pure visual assessment rather than text-image similarity metrics.

	### Key Features

	- Image-Only Evaluation: No text input required, focuses purely on visual quality
	- Human Preference Aligned: Trained on preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset
	- Complementary Design: Works optimally when combined with [ICT model](https://huggingface.co/8y/ICT) for comprehensive evaluation

	### Model Sources

	* Repository: [https://github.com/BarretBa/ICTHP](https://github.com/BarretBa/ICTHP)
	* Paper: [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/2507.19002)
	* Base Model: CLIP-ViT-H-14 (Image Encoder + MLP Head)
	* Training Dataset: [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset (360,000 preference triplets)

	## How to Get Started with the Model

	### Installation

	```bash
	pip install torch transformers pillow numpy open-clip-torch
	```

	### Quick Start
	```python
	# import
	import torch
	from transformers import CLIPModel, CLIPProcessor
	from PIL import Image
	import torch.nn as nn

	class MLP(nn.Module):
	def __init__(self):
	super().__init__()
	self.layers = nn.Sequential(
	nn.Linear(1024, 1024), nn.Dropout(0.2),
	nn.Linear(1024, 128), nn.Dropout(0.2),
	nn.Linear(128, 64), nn.Dropout(0.1),
	nn.Linear(64, 16), nn.Linear(16, 1)
	)
	def forward(self, x):
	return self.layers(x)

	# load model
	device = "cuda"
	processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
	model_pretrained_name_or_path = "8y/HP"

	processor = CLIPProcessor.from_pretrained(processor_name_or_path)
	backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone").eval().to(device)
	scorer = MLP()
	scorer.load_state_dict(torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin"))
	scorer = scorer.eval().to(device)

	def calc_hp_scores(images):
	# preprocess
	image_inputs = processor(
	images=images,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	# extract features
	image_features = backbone.get_image_features(**image_inputs)

	# calculate hp scores
	hp_scores = torch.sigmoid(scorer(image_features))

	return hp_scores.cpu().squeeze().tolist()

	pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
	scores = calc_hp_scores(pil_images)
	print(f"HP Scores: {scores}")
	```

	## Training Details

	### Training Data

	This model was trained on 36000 preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset.


	## Citation

	```bibtex
	@misc{ba2025enhancingrewardmodelshighquality,
	title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment},
	author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen},
	year={2025},
	eprint={2507.19002},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2507.19002},
	}
	```