Improve model card: Add pipeline tag, library, paper, GitHub, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +148 -0
README.md CHANGED
@@ -1,3 +1,151 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
+ # Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
8
+
9
+ This repository contains the `Griffon v2` model, a unified high-resolution generalist model designed to enable flexible object referring with visual and textual prompts.
10
+
11
+ Griffon v2 was presented in the paper [Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring](https://huggingface.co/papers/2403.09333).
12
+
13
+ The abstract of the paper is as follows:
14
+ "Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data and codes are released at this https URL ."
15
+
16
+ The official code and data for the Griffon series (including Griffon v2) can be found on the [GitHub repository](https://github.com/jefferyZhan/Griffon).
17
+
18
+ ## Quick Start
19
+
20
+ This section provides instructions on how to use the Griffon v2 model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
21
+
22
+ First, install the `transformers` library and other necessary dependencies:
23
+
24
+ ```bash
25
+ pip install transformers
26
+ ```
27
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
28
+
29
+ Here is an inference code example for a model like `OS-Atlas-Base-4B`, which is related to the Griffon v2 work:
30
+
31
+ ```python
32
+ import numpy as np
33
+ import torch
34
+ import torchvision.transforms as T
35
+ from PIL import Image
36
+ from torchvision.transforms.functional import InterpolationMode
37
+ from transformers import AutoModel, AutoTokenizer
38
+
39
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
40
+ IMAGENET_STD = (0.229, 0.224, 0.225)
41
+
42
+ def build_transform(input_size):
43
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
44
+ transform = T.Compose([
45
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
46
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
47
+ T.ToTensor(),
48
+ T.Normalize(mean=MEAN, std=STD)
49
+ ])
50
+ return transform
51
+
52
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
53
+ best_ratio_diff = float('inf')
54
+ best_ratio = (1, 1)
55
+ area = width * height
56
+ for ratio in target_ratios:
57
+ target_aspect_ratio = ratio[0] / ratio[1]
58
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
59
+ if ratio_diff < best_ratio_diff:
60
+ best_ratio_diff = ratio_diff
61
+ best_ratio = ratio
62
+ elif ratio_diff == best_ratio_diff:
63
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
64
+ best_ratio = ratio
65
+ return best_ratio
66
+
67
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
68
+ orig_width, orig_height = image.size
69
+ aspect_ratio = orig_width / orig_height
70
+
71
+ # calculate the existing image aspect ratio
72
+ target_ratios = set(
73
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
74
+ i * j <= max_num and i * j >= min_num)
75
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
76
+
77
+ # find the closest aspect ratio to the target
78
+ target_aspect_ratio = find_closest_aspect_ratio(
79
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
80
+
81
+ # calculate the target width and height
82
+ target_width = image_size * target_aspect_ratio[0]
83
+ target_height = image_size * target_aspect_ratio[1]
84
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
85
+
86
+ # resize the image
87
+ resized_img = image.resize((target_width, target_height))
88
+ processed_images = []
89
+ for i in range(blocks):
90
+ box = (
91
+ (i % (target_width // image_size)) * image_size,
92
+ (i // (target_width // image_size)) * image_size,
93
+ ((i % (target_width // image_size)) + 1) * image_size,
94
+ ((i // (target_width // image_size)) + 1) * image_size
95
+ )
96
+ # split the image
97
+ split_img = resized_img.crop(box)
98
+ processed_images.append(split_img)
99
+ assert len(processed_images) == blocks
100
+ if use_thumbnail and len(processed_images) != 1:
101
+ thumbnail_img = image.resize((image_size, image_size))
102
+ processed_images.append(thumbnail_img)
103
+ return processed_images
104
+
105
+ def load_image(image_file, input_size=448, max_num=12):
106
+ image = Image.open(image_file).convert('RGB')
107
+ transform = build_transform(input_size=input_size)
108
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
109
+ pixel_values = [transform(image) for image in images]
110
+ pixel_values = torch.stack(pixel_values)
111
+ return pixel_values
112
+
113
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section in the original GitHub repo.
114
+ # Replace 'OS-Copilot/OS-Atlas-Base-4B' with the actual model ID for Griffon v2 if different.
115
+ path = 'OS-Copilot/OS-Atlas-Base-4B'
116
+ model = AutoModel.from_pretrained(
117
+ path,
118
+ torch_dtype=torch.bfloat16,
119
+ low_cpu_mem_usage=True,
120
+ trust_remote_code=True).eval().cuda()
121
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
122
+
123
+ # set the max number of tiles in `max_num`
124
+ pixel_values = load_image('./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
125
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
126
+
127
+ question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\
128
+ \\\"'Champions League' link\\\"\"
129
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
130
+ print(f'User: {question}\
131
+ Assistant: {response}')
132
+ ```
133
+
134
+ ## Citation
135
+
136
+ If you find Griffon useful for your research and applications, please cite using this BibTeX:
137
+
138
+ ```bibtex
139
+ @misc{zhan2024griffonv2,
140
+ title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring},
141
+ author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
142
+ year={2024},
143
+ eprint={2403.09333},
144
+ archivePrefix={arXiv},
145
+ primaryClass={cs.CV}
146
+ }
147
+ ```
148
+
149
+ ## License
150
+
151
+ The data and checkpoint are licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.