nielsr HF Staff commited on
Commit
c856e4f
Β·
verified Β·
1 Parent(s): 4d839c0

Add comprehensive model card for Osprey

Browse files

This PR adds a comprehensive model card for the Osprey model, significantly improving its documentation on the Hugging Face Hub.

Key improvements include:
- Linking the model to its official paper: [Osprey: Pixel Understanding with Visual Instruction Tuning](https://huggingface.co/papers/2312.10032).
- Including the paper's abstract for quick understanding.
- Adding `pipeline_tag: image-text-to-text` to enable discoverability on the Hub.
- Specifying `library_name: transformers` based on the `LlavaLlamaForCausalLM` architecture found in `config.json`, integrating it with the Hugging Face `transformers` library ecosystem.
- Including a link to the official GitHub repository for code access and further details.
- Incorporating a detailed introduction, core features, and the complete "Try Our Demo" section (online and offline demo setup) directly from the original GitHub repository to provide robust usage instructions.
- All relevant sections from the GitHub README have been adapted to the model card for a holistic view.

Please review and merge this PR to enhance the model's documentation on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +178 -0
README.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
+ tags:
5
+ - multimodal
6
+ - vision-language
7
+ - llava
8
+ - osprey
9
+ ---
10
+
11
+ <p align="center" width="100%">
12
+ <img src="https://github.com/CircleRadon/Osprey/raw/main/assets/osprey.png" width="90%">
13
+ </p>
14
+
15
+ # Osprey: Pixel Understanding with Visual Instruction Tuning
16
+
17
+ This repository contains the Osprey model, presented in the paper [Osprey: Pixel Understanding with Visual Instruction Tuning](https://huggingface.co/papers/2312.10032).
18
+
19
+ <div align=center>
20
+
21
+ ![Static Badge](https://img.shields.io/badge/Osprey-v1-F7C97E) [![arXiv preprint](https://img.shields.io/badge/arxiv-2312.10032-ECA8A7?logo=arxiv)](https://arxiv.org/pdf/2312.10032.pdf) [![Dataset](https://img.shields.io/badge/Dataset-Hugging_Face-CFAFD4)](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K) [![video](https://img.shields.io/badge/Watch_Video-36600E?logo=youtube&logoColor=green)](https://youtu.be/YsxqHBBnDfk) [![Static Badge](https://img.shields.io/badge/Try_Demo-6B88E3?logo=youtubegaming&logoColor=DAE4EE)](http://111.0.123.204:8000/)
22
+ </div>
23
+
24
+ **Paper**: [https://huggingface.co/papers/2312.10032](https://huggingface.co/papers/2312.10032)
25
+ **GitHub Repository**: [https://github.com/CircleRadon/Osprey](https://github.com/CircleRadon/Osprey)
26
+
27
+ ## Abstract
28
+ Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics.
29
+
30
+ ## What is Osprey πŸ‘€
31
+ Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling **fine-grained visual understanding**. Based on input mask region, Osprey generate the semantic descriptions including **short description** and **detailed description**.
32
+
33
+ Our Osprey can seamlessly integrate with [SAM](https://github.com/facebookresearch/segment-anything) in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.
34
+
35
+ <img src="https://github.com/CircleRadon/Osprey/raw/main/assets/framework.png" width="800px">
36
+
37
+ ## Watch Video Demo πŸŽ₯
38
+
39
+ <p align="center"> <a href="https://youtu.be/YsxqHBBnDfk"><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/video_cover.png" width="70%"></a> </p>
40
+
41
+ ## Try Our Demo πŸ•ΉοΈ
42
+ ### Online demo
43
+ **Click** πŸ‘‡ **to try our demo online.**
44
+
45
+ [**web demo**](http://111.0.123.204:8000/)
46
+
47
+ ```
48
+ username: osprey
49
+ password: osprey
50
+ ```
51
+
52
+ <table>
53
+ <tr>
54
+ <td style="text-align: center"><br>Point<br></td>
55
+ <td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_point.gif" width="700"></td>
56
+ </tr>
57
+ <tr>
58
+ <td style="text-align: center"><br>Box<br></td>
59
+ <td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_box.gif" width="700"></td>
60
+ </tr>
61
+ </tr>
62
+ <tr>
63
+ <td style="text-align: center"><br>Everything<br></td>
64
+ <td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_all.gif" width="700"></td>
65
+ </tr>
66
+ </table>
67
+
68
+ ### Offline demo
69
+ πŸ’» **requirments:** For this demo, it needs about `17GB` GPU memory for Osprey(15GB) and SAM(2GB).
70
+
71
+ 1. First install [Gradio-Osprey-Demo](https://github.com/LiWentomng/gradio-osprey-demo).
72
+ 2. Install Segment Anything.
73
+ ```bash
74
+ pip install git+https://github.com/facebookresearch/segment-anything.git
75
+ ```
76
+
77
+ 3. Download all the checkpoints:
78
+
79
+ - [Osprey-7b](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
80
+ - [CLIP-convnext](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin)
81
+ - [ViT-B SAM model](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)
82
+
83
+ The default path of all the checkpoints:
84
+ ```
85
+ β”œβ”€β”€ demo
86
+ β”œβ”€β”€ checkpoints
87
+ β”‚ β”œβ”€β”€ Osprey_7b
88
+ β”‚ └── sam_vit_b_01ec64.pth
89
+ └── open_clip_pytorch_model.bin
90
+ ```
91
+
92
+ Or change the "mm_vision_tower" in `config.json` of Osprey-7b model to the Absolute Path of `open_clip_pytorch_model.bin`.
93
+
94
+ 4. Run `app.py`.
95
+ ```bash
96
+ cd demo
97
+ python app.py --model checkpoints/Osprey_7b
98
+ ```
99
+
100
+ ## Install πŸ› οΈ
101
+ 1. Clone this repository and navigate to Osprey folder
102
+ ```bash
103
+ git clone https://github.com/CircleRadon/Osprey.git
104
+ cd Osprey
105
+ ```
106
+ 2. Install packages
107
+ ```bash
108
+ conda create -n osprey python=3.10 -y
109
+ conda activate osprey
110
+ pip install --upgrade pip # enable PEP 660 support
111
+ pip install -e .
112
+ ```
113
+ 3. Install additional packages for training cases
114
+ ```bash
115
+ pip install -e ".[train]"
116
+ pip install flash-attn --no-build-isolation
117
+ ```
118
+
119
+ ## Dataset 🌟
120
+ The all datasets for training can be found in [Dataset preparation](https://github.com/CircleRadon/Osprey/blob/main/dataset.md).
121
+
122
+ **Osprey-724K**: πŸ€—[Hugging Face](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K)
123
+
124
+ `Osprey-724K` is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.
125
+ <img src="https://github.com/CircleRadon/Osprey/raw/main/assets/data.png" />
126
+
127
+ ## Training πŸš€
128
+ - **Stage1: Image-Text Alignment Pre-training**
129
+ - The pretrained projector weights for Convnext-large-CLIP can be found in [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
130
+
131
+ - **Stage2: Mask-Text Alignment Pre-training**
132
+ - Download [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main).
133
+ - Download projector weights trained in stage1: [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
134
+ - Set `model_name_or_path` in `stage2.sh` to the path of `vicuna-7b-v1.5`.
135
+ - Set `pretrain_mm_mlp_adapter` in `stage2.sh` to the path of `mm_projector`.
136
+ - Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
137
+ - Run `sh scripts/stage2.sh`.
138
+
139
+ - **Stage3: End-to-End Fine-tuning**
140
+
141
+ - Set `model_name_or_path` in `stage2.sh` to the path of `stage2 checkpoint`.
142
+ - Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
143
+ - Run `sh scripts/stage3.sh`.
144
+
145
+ ## Checkpoints πŸ€–
146
+
147
+ Osprey-7b modelπŸ€—: [model](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
148
+
149
+ We also provide the checkpoint of intermediate stage2, please check [model](https://huggingface.co/sunshine-lwt/Osprey-7b-stage2/tree/main).
150
+
151
+ <div align=center>
152
+ <img src="https://github.com/CircleRadon/Osprey/raw/main/assets/performance.png" />
153
+ </div>
154
+
155
+ ## Evaluation πŸ”Ž
156
+ See [evaluation](https://github.com/CircleRadon/Osprey/raw/main/osprey/eval/README.md) for details.
157
+
158
+ ## TODO List πŸ“
159
+ - [x] Release the checkpoints, inference codes and demo.
160
+ - [x] Release the dataset and training scripts.
161
+ - [x] Release the evaluation code.
162
+ - [x] Release the code for data generation pipeline.
163
+
164
+ ## Acknowledgement πŸ’Œ
165
+ - [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
166
+ - [SAM](https://github.com/facebookresearch/segment-anything): the demo uses the segmentation result from SAM as the input of Osprey.
167
+
168
+ ## BibTeX πŸ–ŠοΈ
169
+ ```bibtex
170
+ @misc{Osprey,
171
+ title={Osprey: Pixel Understanding with Visual Instruction Tuning},
172
+ author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
173
+ year={2023},
174
+ eprint={2312.10032},
175
+ archivePrefix={arXiv},
176
+ primaryClass={cs.CV}
177
+ }
178
+ ```