8y commited on
Commit
bfd2e0b
·
verified ·
1 Parent(s): 3813b74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -3
README.md CHANGED
@@ -1,3 +1,122 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Model Card for ICT (Image-Contained-Text) Model
3
+
4
+ This model is a specialized text-image alignment evaluation function that quantifies the extent to which an image contains textual information, without penalizing high-quality images with rich visual details. See our paper [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment]() for more details.
5
+
6
+ ## Model Details
7
+
8
+ ### Model Description
9
+
10
+ The ICT (Image-Contained-Text) model represents a paradigm shift in text-image alignment evaluation by addressing the fundamental flaw in existing reward models. When traditional metrics like CLIP Score inappropriately assign low scores to images with rich details and high aesthetic value, the ICT model continues to properly evaluate text-image alignment based on how well an image **contains** textual content.
11
+
12
+ ### Key Features
13
+
14
+ - **Threshold-Based Evaluation**: Uses adaptive threshold mechanism instead of direct similarity scoring
15
+ - **Human Preference Aligned**: Trained on hierarchical preference triplets from [Pick-High dataset](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset
16
+ - **Complementary Design**: Works optimally when combined with [HP model](https://huggingface.co/8y/HP) for comprehensive evaluation
17
+
18
+ ### Model Sources
19
+
20
+ * **Repository:** [https://github.com/BarretBa/ICTHP](https://github.com/BarretBa/ICTHP)
21
+ * **Paper:** [Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment](https://arxiv.org/abs/xxxx.xxxxx)
22
+ * **Base Model:** CLIP-ViT-H-14 (Fine-tuned with ICT objectives)
23
+ * **Training Dataset:** [Pick-High dataset](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset (360,000 preference triplets)
24
+
25
+ ## How to Get Started with the Model
26
+
27
+ ### Installation
28
+
29
+ ```bash
30
+ pip install torch transformers pillow numpy open-clip-torch
31
+ ```
32
+
33
+ ### Quick Start
34
+ ```python
35
+ # import
36
+ import torch
37
+ from transformers import CLIPModel, CLIPProcessor
38
+ from PIL import Image
39
+ from hpsv2.src.open_clip import get_tokenizer
40
+
41
+ # load model
42
+ device = "cuda"
43
+ processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
44
+ model_pretrained_name_or_path = "8y/ICT"
45
+
46
+ processor = CLIPProcessor.from_pretrained(processor_name_or_path)
47
+ preprocess_val = lambda img: processor(images=img, return_tensors="pt")["pixel_values"]
48
+
49
+ # Load ICT model
50
+ ict_model = CLIPModel.from_pretrained(processor_name_or_path)
51
+ checkpoint_path = f"{model_pretrained_name_or_path}/pytorch_model.bin"
52
+ state_dict = torch.load(checkpoint_path, map_location="cpu")
53
+ ict_model.load_state_dict(state_dict, strict=False)
54
+ ict_model = ict_model.to(device)
55
+ ict_model.eval()
56
+
57
+ # Get tokenizer
58
+ tokenizer = get_tokenizer('ViT-H-14')
59
+
60
+ def calc_ict_scores(images, texts):
61
+ # preprocess
62
+ image_scores = []
63
+ for image in images:
64
+ image_score = preprocess_val(image).to(device)
65
+ image_scores.append(image_score)
66
+
67
+ ict_scores = []
68
+
69
+ with torch.no_grad():
70
+ for image_score, text in zip(image_scores, texts):
71
+ # extract features
72
+ image_ict_features = ict_model.get_image_features(pixel_values=image_score)
73
+ image_ict_features = image_ict_features / image_ict_features.norm(dim=-1, keepdim=True)
74
+
75
+ # process text input
76
+ text_input_ids = tokenizer(text).to(device)
77
+ text_features_ict = ict_model.get_text_features(text_input_ids)
78
+ text_features_ict = text_features_ict / text_features_ict.norm(dim=-1, keepdim=True)
79
+
80
+ # calculate ICT scores
81
+ ict_score = text_features_ict @ image_ict_features.T
82
+ ict_scores.append(ict_score.cpu().item() if ict_score.dim() == 0 else ict_score.cpu().squeeze().item())
83
+
84
+ return ict_scores
85
+
86
+ pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
87
+ texts = ["prompt for image1", "prompt for image2"]
88
+ scores = calc_ict_scores(pil_images, texts)
89
+ print(f"ICT Scores: {scores}")
90
+ ```
91
+
92
+ ## Training Details
93
+
94
+ ### Training Objective
95
+
96
+ **ICT Scoring Framework**: Instead of direct CLIP similarity, ICT employs a threshold-based mechanism:
97
+
98
+ ```
99
+ C(I,P) = min(CLIP(I,P) / θ, 1)
100
+ ```
101
+
102
+ ICT models are trained using hierarchical scoring with triplet rankings:
103
+ * **E₃ = 1**: High-quality images with refined prompts
104
+ * **E₂ = C(I₂, P_easy)**: Medium-quality images
105
+ * **E₁ = min(C(I₁, P_easy), E₂)**: Low-quality images
106
+
107
+ ### Training Data
108
+
109
+ This model was trained on 360,000 preference triplets from [Pick-High dataset](https://huggingface.co/datasets/8y/Pick-High-Dataset) and Pick-a-pic dataset.
110
+
111
+ <!--
112
+
113
+ ## Citation
114
+
115
+ ```bibtex
116
+ @article{ba2024enhancing,
117
+ title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment},
118
+ author={Ba, Ying and Zhang, Tianyu and Bai, Yalong and Mo, Wenyi and Liang, Tao and Su, Bing and Wen, Ji-Rong},
119
+ journal={arXiv preprint arXiv:xxxx.xxxxx},
120
+ year={2024}
121
+ }
122
+ ``` -->