Improve model card: Add pipeline tag, library, paper, GitHub, and usage
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,151 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
+
# Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
|
| 8 |
+
|
| 9 |
+
This repository contains the `Griffon v2` model, a unified high-resolution generalist model designed to enable flexible object referring with visual and textual prompts.
|
| 10 |
+
|
| 11 |
+
Griffon v2 was presented in the paper [Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring](https://huggingface.co/papers/2403.09333).
|
| 12 |
+
|
| 13 |
+
The abstract of the paper is as follows:
|
| 14 |
+
"Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data and codes are released at this https URL ."
|
| 15 |
+
|
| 16 |
+
The official code and data for the Griffon series (including Griffon v2) can be found on the [GitHub repository](https://github.com/jefferyZhan/Griffon).
|
| 17 |
+
|
| 18 |
+
## Quick Start
|
| 19 |
+
|
| 20 |
+
This section provides instructions on how to use the Griffon v2 model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
|
| 21 |
+
|
| 22 |
+
First, install the `transformers` library and other necessary dependencies:
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
pip install transformers
|
| 26 |
+
```
|
| 27 |
+
For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
|
| 28 |
+
|
| 29 |
+
Here is an inference code example for a model like `OS-Atlas-Base-4B`, which is related to the Griffon v2 work:
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
import numpy as np
|
| 33 |
+
import torch
|
| 34 |
+
import torchvision.transforms as T
|
| 35 |
+
from PIL import Image
|
| 36 |
+
from torchvision.transforms.functional import InterpolationMode
|
| 37 |
+
from transformers import AutoModel, AutoTokenizer
|
| 38 |
+
|
| 39 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
| 40 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
| 41 |
+
|
| 42 |
+
def build_transform(input_size):
|
| 43 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
| 44 |
+
transform = T.Compose([
|
| 45 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
| 46 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
| 47 |
+
T.ToTensor(),
|
| 48 |
+
T.Normalize(mean=MEAN, std=STD)
|
| 49 |
+
])
|
| 50 |
+
return transform
|
| 51 |
+
|
| 52 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
| 53 |
+
best_ratio_diff = float('inf')
|
| 54 |
+
best_ratio = (1, 1)
|
| 55 |
+
area = width * height
|
| 56 |
+
for ratio in target_ratios:
|
| 57 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
| 58 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
| 59 |
+
if ratio_diff < best_ratio_diff:
|
| 60 |
+
best_ratio_diff = ratio_diff
|
| 61 |
+
best_ratio = ratio
|
| 62 |
+
elif ratio_diff == best_ratio_diff:
|
| 63 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
| 64 |
+
best_ratio = ratio
|
| 65 |
+
return best_ratio
|
| 66 |
+
|
| 67 |
+
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
| 68 |
+
orig_width, orig_height = image.size
|
| 69 |
+
aspect_ratio = orig_width / orig_height
|
| 70 |
+
|
| 71 |
+
# calculate the existing image aspect ratio
|
| 72 |
+
target_ratios = set(
|
| 73 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
| 74 |
+
i * j <= max_num and i * j >= min_num)
|
| 75 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
| 76 |
+
|
| 77 |
+
# find the closest aspect ratio to the target
|
| 78 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
| 79 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
| 80 |
+
|
| 81 |
+
# calculate the target width and height
|
| 82 |
+
target_width = image_size * target_aspect_ratio[0]
|
| 83 |
+
target_height = image_size * target_aspect_ratio[1]
|
| 84 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
| 85 |
+
|
| 86 |
+
# resize the image
|
| 87 |
+
resized_img = image.resize((target_width, target_height))
|
| 88 |
+
processed_images = []
|
| 89 |
+
for i in range(blocks):
|
| 90 |
+
box = (
|
| 91 |
+
(i % (target_width // image_size)) * image_size,
|
| 92 |
+
(i // (target_width // image_size)) * image_size,
|
| 93 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
| 94 |
+
((i // (target_width // image_size)) + 1) * image_size
|
| 95 |
+
)
|
| 96 |
+
# split the image
|
| 97 |
+
split_img = resized_img.crop(box)
|
| 98 |
+
processed_images.append(split_img)
|
| 99 |
+
assert len(processed_images) == blocks
|
| 100 |
+
if use_thumbnail and len(processed_images) != 1:
|
| 101 |
+
thumbnail_img = image.resize((image_size, image_size))
|
| 102 |
+
processed_images.append(thumbnail_img)
|
| 103 |
+
return processed_images
|
| 104 |
+
|
| 105 |
+
def load_image(image_file, input_size=448, max_num=12):
|
| 106 |
+
image = Image.open(image_file).convert('RGB')
|
| 107 |
+
transform = build_transform(input_size=input_size)
|
| 108 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
| 109 |
+
pixel_values = [transform(image) for image in images]
|
| 110 |
+
pixel_values = torch.stack(pixel_values)
|
| 111 |
+
return pixel_values
|
| 112 |
+
|
| 113 |
+
# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section in the original GitHub repo.
|
| 114 |
+
# Replace 'OS-Copilot/OS-Atlas-Base-4B' with the actual model ID for Griffon v2 if different.
|
| 115 |
+
path = 'OS-Copilot/OS-Atlas-Base-4B'
|
| 116 |
+
model = AutoModel.from_pretrained(
|
| 117 |
+
path,
|
| 118 |
+
torch_dtype=torch.bfloat16,
|
| 119 |
+
low_cpu_mem_usage=True,
|
| 120 |
+
trust_remote_code=True).eval().cuda()
|
| 121 |
+
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
| 122 |
+
|
| 123 |
+
# set the max number of tiles in `max_num`
|
| 124 |
+
pixel_values = load_image('./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
|
| 125 |
+
generation_config = dict(max_new_tokens=1024, do_sample=True)
|
| 126 |
+
|
| 127 |
+
question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\
|
| 128 |
+
\\\"'Champions League' link\\\"\"
|
| 129 |
+
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
| 130 |
+
print(f'User: {question}\
|
| 131 |
+
Assistant: {response}')
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## Citation
|
| 135 |
+
|
| 136 |
+
If you find Griffon useful for your research and applications, please cite using this BibTeX:
|
| 137 |
+
|
| 138 |
+
```bibtex
|
| 139 |
+
@misc{zhan2024griffonv2,
|
| 140 |
+
title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring},
|
| 141 |
+
author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
|
| 142 |
+
year={2024},
|
| 143 |
+
eprint={2403.09333},
|
| 144 |
+
archivePrefix={arXiv},
|
| 145 |
+
primaryClass={cs.CV}
|
| 146 |
+
}
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## License
|
| 150 |
+
|
| 151 |
+
The data and checkpoint are licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
|