Add comprehensive model card for Osprey
Browse filesThis PR adds a comprehensive model card for the Osprey model, significantly improving its documentation on the Hugging Face Hub.
Key improvements include:
- Linking the model to its official paper: [Osprey: Pixel Understanding with Visual Instruction Tuning](https://huggingface.co/papers/2312.10032).
- Including the paper's abstract for quick understanding.
- Adding `pipeline_tag: image-text-to-text` to enable discoverability on the Hub.
- Specifying `library_name: transformers` based on the `LlavaLlamaForCausalLM` architecture found in `config.json`, integrating it with the Hugging Face `transformers` library ecosystem.
- Including a link to the official GitHub repository for code access and further details.
- Incorporating a detailed introduction, core features, and the complete "Try Our Demo" section (online and offline demo setup) directly from the original GitHub repository to provide robust usage instructions.
- All relevant sections from the GitHub README have been adapted to the model card for a holistic view.
Please review and merge this PR to enhance the model's documentation on the Hugging Face Hub.
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-text-to-text
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- multimodal
|
| 6 |
+
- vision-language
|
| 7 |
+
- llava
|
| 8 |
+
- osprey
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
<p align="center" width="100%">
|
| 12 |
+
<img src="https://github.com/CircleRadon/Osprey/raw/main/assets/osprey.png" width="90%">
|
| 13 |
+
</p>
|
| 14 |
+
|
| 15 |
+
# Osprey: Pixel Understanding with Visual Instruction Tuning
|
| 16 |
+
|
| 17 |
+
This repository contains the Osprey model, presented in the paper [Osprey: Pixel Understanding with Visual Instruction Tuning](https://huggingface.co/papers/2312.10032).
|
| 18 |
+
|
| 19 |
+
<div align=center>
|
| 20 |
+
|
| 21 |
+
 [](https://arxiv.org/pdf/2312.10032.pdf) [](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K) [](https://youtu.be/YsxqHBBnDfk) [](http://111.0.123.204:8000/)
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
**Paper**: [https://huggingface.co/papers/2312.10032](https://huggingface.co/papers/2312.10032)
|
| 25 |
+
**GitHub Repository**: [https://github.com/CircleRadon/Osprey](https://github.com/CircleRadon/Osprey)
|
| 26 |
+
|
| 27 |
+
## Abstract
|
| 28 |
+
Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics.
|
| 29 |
+
|
| 30 |
+
## What is Osprey π
|
| 31 |
+
Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling **fine-grained visual understanding**. Based on input mask region, Osprey generate the semantic descriptions including **short description** and **detailed description**.
|
| 32 |
+
|
| 33 |
+
Our Osprey can seamlessly integrate with [SAM](https://github.com/facebookresearch/segment-anything) in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.
|
| 34 |
+
|
| 35 |
+
<img src="https://github.com/CircleRadon/Osprey/raw/main/assets/framework.png" width="800px">
|
| 36 |
+
|
| 37 |
+
## Watch Video Demo π₯
|
| 38 |
+
|
| 39 |
+
<p align="center"> <a href="https://youtu.be/YsxqHBBnDfk"><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/video_cover.png" width="70%"></a> </p>
|
| 40 |
+
|
| 41 |
+
## Try Our Demo πΉοΈ
|
| 42 |
+
### Online demo
|
| 43 |
+
**Click** π **to try our demo online.**
|
| 44 |
+
|
| 45 |
+
[**web demo**](http://111.0.123.204:8000/)
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
username: osprey
|
| 49 |
+
password: osprey
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
<table>
|
| 53 |
+
<tr>
|
| 54 |
+
<td style="text-align: center"><br>Point<br></td>
|
| 55 |
+
<td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_point.gif" width="700"></td>
|
| 56 |
+
</tr>
|
| 57 |
+
<tr>
|
| 58 |
+
<td style="text-align: center"><br>Box<br></td>
|
| 59 |
+
<td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_box.gif" width="700"></td>
|
| 60 |
+
</tr>
|
| 61 |
+
</tr>
|
| 62 |
+
<tr>
|
| 63 |
+
<td style="text-align: center"><br>Everything<br></td>
|
| 64 |
+
<td><img src="https://github.com/CircleRadon/Osprey/raw/main/assets/demo_all.gif" width="700"></td>
|
| 65 |
+
</tr>
|
| 66 |
+
</table>
|
| 67 |
+
|
| 68 |
+
### Offline demo
|
| 69 |
+
π» **requirments:** For this demo, it needs about `17GB` GPU memory for Osprey(15GB) and SAM(2GB).
|
| 70 |
+
|
| 71 |
+
1. First install [Gradio-Osprey-Demo](https://github.com/LiWentomng/gradio-osprey-demo).
|
| 72 |
+
2. Install Segment Anything.
|
| 73 |
+
```bash
|
| 74 |
+
pip install git+https://github.com/facebookresearch/segment-anything.git
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
3. Download all the checkpoints:
|
| 78 |
+
|
| 79 |
+
- [Osprey-7b](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
|
| 80 |
+
- [CLIP-convnext](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin)
|
| 81 |
+
- [ViT-B SAM model](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)
|
| 82 |
+
|
| 83 |
+
The default path of all the checkpoints:
|
| 84 |
+
```
|
| 85 |
+
βββ demo
|
| 86 |
+
βββ checkpoints
|
| 87 |
+
β βββ Osprey_7b
|
| 88 |
+
β βββ sam_vit_b_01ec64.pth
|
| 89 |
+
βββ open_clip_pytorch_model.bin
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
Or change the "mm_vision_tower" in `config.json` of Osprey-7b model to the Absolute Path of `open_clip_pytorch_model.bin`.
|
| 93 |
+
|
| 94 |
+
4. Run `app.py`.
|
| 95 |
+
```bash
|
| 96 |
+
cd demo
|
| 97 |
+
python app.py --model checkpoints/Osprey_7b
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Install π οΈ
|
| 101 |
+
1. Clone this repository and navigate to Osprey folder
|
| 102 |
+
```bash
|
| 103 |
+
git clone https://github.com/CircleRadon/Osprey.git
|
| 104 |
+
cd Osprey
|
| 105 |
+
```
|
| 106 |
+
2. Install packages
|
| 107 |
+
```bash
|
| 108 |
+
conda create -n osprey python=3.10 -y
|
| 109 |
+
conda activate osprey
|
| 110 |
+
pip install --upgrade pip # enable PEP 660 support
|
| 111 |
+
pip install -e .
|
| 112 |
+
```
|
| 113 |
+
3. Install additional packages for training cases
|
| 114 |
+
```bash
|
| 115 |
+
pip install -e ".[train]"
|
| 116 |
+
pip install flash-attn --no-build-isolation
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## Dataset π
|
| 120 |
+
The all datasets for training can be found in [Dataset preparation](https://github.com/CircleRadon/Osprey/blob/main/dataset.md).
|
| 121 |
+
|
| 122 |
+
**Osprey-724K**: π€[Hugging Face](https://huggingface.co/datasets/AntGroup-MI/Osprey-724K)
|
| 123 |
+
|
| 124 |
+
`Osprey-724K` is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.
|
| 125 |
+
<img src="https://github.com/CircleRadon/Osprey/raw/main/assets/data.png" />
|
| 126 |
+
|
| 127 |
+
## Training π
|
| 128 |
+
- **Stage1: Image-Text Alignment Pre-training**
|
| 129 |
+
- The pretrained projector weights for Convnext-large-CLIP can be found in [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
|
| 130 |
+
|
| 131 |
+
- **Stage2: Mask-Text Alignment Pre-training**
|
| 132 |
+
- Download [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main).
|
| 133 |
+
- Download projector weights trained in stage1: [projector weights](https://huggingface.co/sunshine-lwt/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/tree/main).
|
| 134 |
+
- Set `model_name_or_path` in `stage2.sh` to the path of `vicuna-7b-v1.5`.
|
| 135 |
+
- Set `pretrain_mm_mlp_adapter` in `stage2.sh` to the path of `mm_projector`.
|
| 136 |
+
- Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
|
| 137 |
+
- Run `sh scripts/stage2.sh`.
|
| 138 |
+
|
| 139 |
+
- **Stage3: End-to-End Fine-tuning**
|
| 140 |
+
|
| 141 |
+
- Set `model_name_or_path` in `stage2.sh` to the path of `stage2 checkpoint`.
|
| 142 |
+
- Set `vision_tower` in `stage2.sh` to the path of [Convnext-large-CLIP-model](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin).
|
| 143 |
+
- Run `sh scripts/stage3.sh`.
|
| 144 |
+
|
| 145 |
+
## Checkpoints π€
|
| 146 |
+
|
| 147 |
+
Osprey-7b modelπ€: [model](https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main)
|
| 148 |
+
|
| 149 |
+
We also provide the checkpoint of intermediate stage2, please check [model](https://huggingface.co/sunshine-lwt/Osprey-7b-stage2/tree/main).
|
| 150 |
+
|
| 151 |
+
<div align=center>
|
| 152 |
+
<img src="https://github.com/CircleRadon/Osprey/raw/main/assets/performance.png" />
|
| 153 |
+
</div>
|
| 154 |
+
|
| 155 |
+
## Evaluation π
|
| 156 |
+
See [evaluation](https://github.com/CircleRadon/Osprey/raw/main/osprey/eval/README.md) for details.
|
| 157 |
+
|
| 158 |
+
## TODO List π
|
| 159 |
+
- [x] Release the checkpoints, inference codes and demo.
|
| 160 |
+
- [x] Release the dataset and training scripts.
|
| 161 |
+
- [x] Release the evaluation code.
|
| 162 |
+
- [x] Release the code for data generation pipeline.
|
| 163 |
+
|
| 164 |
+
## Acknowledgement π
|
| 165 |
+
- [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
|
| 166 |
+
- [SAM](https://github.com/facebookresearch/segment-anything): the demo uses the segmentation result from SAM as the input of Osprey.
|
| 167 |
+
|
| 168 |
+
## BibTeX ποΈ
|
| 169 |
+
```bibtex
|
| 170 |
+
@misc{Osprey,
|
| 171 |
+
title={Osprey: Pixel Understanding with Visual Instruction Tuning},
|
| 172 |
+
author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
|
| 173 |
+
year={2023},
|
| 174 |
+
eprint={2312.10032},
|
| 175 |
+
archivePrefix={arXiv},
|
| 176 |
+
primaryClass={cs.CV}
|
| 177 |
+
}
|
| 178 |
+
```
|