8y commited on
Commit
4820f0d
·
verified ·
1 Parent(s): fbb87be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card for HP (High-Preference) Model
2
+
3
+ This model is a specialized human preference scoring function that evaluates image quality based purely on visual aesthetics and human preferences, without relying on text-image alignment. See our paper [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment]() for more details.
4
+
5
+ ## Model Details
6
+
7
+ ### Model Description
8
+
9
+ The HP (High-Preference) model represents a paradigm shift in image quality evaluation by operating solely on the **image modality**. When text-image alignment reaches saturation ([ICT score](https://huggingface.co/8y/ICT) approaches 1), the HP model continues to differentiate image quality based on aesthetic and perceptual factors that matter to human viewers.
10
+
11
+ **Core Philosophy**: Once an image adequately represents textual content, further quality improvements should be measured through pure visual assessment rather than text-image similarity metrics.
12
+
13
+ ### Key Features
14
+
15
+ - **Image-Only Evaluation**: No text input required, focuses purely on visual quality
16
+ - **Human Preference Aligned**: Trained on preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset
17
+ - **Complementary Design**: Works optimally when combined with [ICT model](https://huggingface.co/8y/ICT) for comprehensive evaluation
18
+
19
+ ### Model Sources
20
+
21
+ * **Repository:** [https://github.com/BarretBa/ICTHP](https://github.com/BarretBa/ICTHP)
22
+ * **Paper:** [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/xxxx.xxxxx)
23
+ * **Base Model:** CLIP-ViT-H-14 (Image Encoder + MLP Head)
24
+ * **Training Dataset:** [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset (360,000 preference triplets)
25
+
26
+ ## How to Get Started with the Model
27
+
28
+ ### Installation
29
+
30
+ ```bash
31
+ pip install torch transformers pillow numpy open-clip-torch
32
+ ```
33
+
34
+ ### Quick Start
35
+ ```python
36
+ # import
37
+ import torch
38
+ from transformers import CLIPModel, CLIPProcessor
39
+ from PIL import Image
40
+ import torch.nn as nn
41
+
42
+ class MLP(nn.Module):
43
+ def __init__(self):
44
+ super().__init__()
45
+ self.layers = nn.Sequential(
46
+ nn.Linear(1024, 1024), nn.Dropout(0.2),
47
+ nn.Linear(1024, 128), nn.Dropout(0.2),
48
+ nn.Linear(128, 64), nn.Dropout(0.1),
49
+ nn.Linear(64, 16), nn.Linear(16, 1)
50
+ )
51
+ def forward(self, x):
52
+ return self.layers(x)
53
+
54
+ # load model
55
+ device = "cuda"
56
+ processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
57
+ model_pretrained_name_or_path = "8y/HP"
58
+
59
+ processor = CLIPProcessor.from_pretrained(processor_name_or_path)
60
+ backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone").eval().to(device)
61
+ scorer = MLP()
62
+ scorer.load_state_dict(torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin"))
63
+ scorer = scorer.eval().to(device)
64
+
65
+ def calc_hp_scores(images):
66
+ # preprocess
67
+ image_inputs = processor(
68
+ images=images,
69
+ return_tensors="pt"
70
+ ).to(device)
71
+
72
+ with torch.no_grad():
73
+ # extract features
74
+ image_features = backbone.get_image_features(**image_inputs)
75
+
76
+ # calculate hp scores
77
+ hp_scores = torch.sigmoid(scorer(image_features))
78
+
79
+ return hp_scores.cpu().squeeze().tolist()
80
+
81
+ pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
82
+ scores = calc_hp_scores(pil_images)
83
+ print(f"HP Scores: {scores}")
84
+ ```
85
+
86
+ ## Training Details
87
+
88
+ ### Training Data
89
+
90
+ This model was trained on 36000 preference triplets from [Pick-High datase](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset.
91
+ <!--
92
+
93
+ ## Citation
94
+
95
+ ```bibtex
96
+ @article{ba2024enhancing,
97
+ title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment},
98
+ author={Ba, Ying and Zhang, Tianyu and Bai, Yalong and Mo, Wenyi and Liang, Tao and Su, Bing and Wen, Ji-Rong},
99
+ journal={arXiv preprint arXiv:xxxx.xxxxx},
100
+ year={2024}
101
+ }
102
+ ``` -->