derektan commited on
Commit
3015687
·
1 Parent(s): 1faad26

Init proper README

Browse files
Files changed (1) hide show
  1. README.md +9 -339
README.md CHANGED
@@ -1,339 +1,9 @@
1
- [![Gradio](https://img.shields.io/badge/Gradio-Online%20Demo-blue)](http://103.170.5.190:7860/)
2
- [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/openxlab-app/LISA)
3
-
4
- # LISA: Reasoning Segmentation via Large Language Model
5
-
6
- Note: This is fork from the [original LISA webpage](https://github.com/dvlab-research/LISA), finetuned on visual search ([AVS-Bench](https://huggingface.co/datasets/derektan95/avs-bench)) remote sensing dataset as a baseline to [Search-TTA](https://search-tta.github.io/). To run the finetuned LISA model on the AVS-Bench dataset, please run the following:
7
- ```
8
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='derektan95/LISA-RS' --precision='bf16'
9
- CUDA_VISIBLE_DEVICES=0 python app.py --version='derektan95/LISA-RS' --precision='bf16'
10
- ```
11
-
12
- <font size=7><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
13
-
14
- <font size=7><div align='center'>
15
- <a href="https://arxiv.org/pdf/2308.00692.pdf"><strong>Paper</strong></a> |
16
- <a href="https://huggingface.co/xinlai"><strong>Models</strong></a> |
17
- <a href="#training"><strong>Training</strong></a> |
18
- <a href="#inference"><strong>Inference</strong></a> |
19
- <a href="#deployment"><strong>Local Deployment</strong></a> |
20
- <a href="#dataset"><strong>Dataset</strong></a> |
21
- <a href=""><strong>Online Demo</strong></a> |
22
- <a href="https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e"><strong>LISA++ Dataset</strong></a> |
23
- <a href="https://huggingface.co/Senqiao/LISA_Plus_7b"><strong>LISA++ Model</strong></a>
24
- </div></font>
25
-
26
- <!-- <p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p> -->
27
-
28
- <table class="center">
29
- <tr>
30
- <td style="text-align:center;"><b>Input</b></td>
31
- <td style="text-align:center;"><b>Output</b></td>
32
- <td style="text-align:center;"><b>Input</b></td>
33
- <td style="text-align:center;"><b>Output</b></td>
34
- </tr>
35
- <tr>
36
- <td><img src="imgs/obama.jpg"></td>
37
- <td><img src="vis_output/obama.jpg"></td>
38
- <td><img src="imgs/trump.jpg"></td>
39
- <td><img src="vis_output/trump.jpg"></td>
40
- </tr>
41
- <tr>
42
- <td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td>
43
- <td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td>
44
- <td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td>
45
- <td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td>
46
- </tr>
47
-
48
- <tr>
49
- <td><img src="imgs/stand_higher.jpg"></td>
50
- <td><img src="vis_output/stand_higher.jpg"></td>
51
- <td><img src="imgs/camera_lens.jpg"></td>
52
- <td><img src="vis_output/camera_lens.jpg"></td>
53
- </tr>
54
- <tr>
55
- <td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td>
56
- <td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td>
57
- <td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td>
58
- <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
59
- </tr>
60
-
61
- <tr>
62
- <td><img src="imgs/dog_with_horn.jpg"></td>
63
- <td><img src="vis_output/dog_with_horn.jpg"></td>
64
- <td><img src="imgs/wash_hands.jpg"></td>
65
- <td><img src="vis_output/wash_hands.jpg"></td>
66
- </tr>
67
- <tr>
68
- <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td>
69
- <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td>
70
- <td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td>
71
- <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
72
- </tr>
73
-
74
- <tr>
75
- <td><img src="imgs/jackma.jpg"></td>
76
- <td><img src="vis_output/jackma.jpg"></td>
77
- <td><img src="imgs/blackpink.jpg"></td>
78
- <td><img src="vis_output/blackpink.jpg"></td>
79
- </tr>
80
- <tr>
81
- <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td>
82
- <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td>
83
- <td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td>
84
- <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
85
- </tr>
86
- </table>
87
-
88
- <p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p>
89
-
90
- ## News
91
- - [x] [2024.12.30] We released the [LISA++](https://arxiv.org/abs/2312.17240) model and datasets, available [here](https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e). Our findings show that incorporating Visual COT data can further enhance the model’s global understanding. We will update the paper soon, stay tuned!
92
- - [x] [2024.6.21] LISA is selected as Oral Presentation in CVPR 2024!
93
- - [x] [2023.8.30] Release three new models [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1), [LISA-7B-v1-explanatory](https://huggingface.co/xinlai/LISA-7B-v1-explanatory), and [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory). Welcome to check them out!
94
- - [x] [2023.8.23] Refactor code, and release new model [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1). Welcome to check it out!
95
- - [x] [2023.8.9] Training code is released!
96
- - [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
97
- - [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released!
98
- - [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check them out!
99
- - [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
100
-
101
- **LISA: Reasoning Segmentation via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]** <br />
102
- [Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN),
103
- [Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl=en),
104
- [Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl=en),
105
- [Yanwei Li](https://scholar.google.com/citations?user=I-UCPPcAAAAJ&hl=zh-CN),
106
- [Yuhui Yuan](https://scholar.google.com/citations?user=PzyvzksAAAAJ&hl=en),
107
- [Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ&hl=zh-CN),
108
- [Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br />
109
-
110
- **LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model [[Paper](https://arxiv.org/abs/2312.17240)]** <br />
111
- [Senqiao Yang](https://scholar.google.com/citations?user=NcJc-RwAAAAJ),
112
- Tianyuan Qu,
113
- [Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN),
114
- [Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl=en),
115
- [Bohao Peng](https://scholar.google.com.hk/citations?user=9xcCm1oAAAAJ),
116
- [Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ&hl=zh-CN),
117
- [Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br />
118
-
119
- ## Abstract
120
- In this work, we propose a new segmentation task --- ***reasoning segmentation***. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks.
121
- For more details, please refer to the [paper](https://arxiv.org/abs/2308.00692).
122
-
123
- ## Highlights
124
- **LISA** unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:
125
- 1. complex reasoning;
126
- 2. world knowledge;
127
- 3. explanatory answers;
128
- 4. multi-turn conversation.
129
-
130
- **LISA** also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.
131
-
132
- ## Experimental results
133
- <p align="center"> <img src="imgs/table1.jpg" width="80%"> </p>
134
-
135
- ## Installation
136
- ```
137
- pip install -r requirements.txt
138
- pip install flash-attn --no-build-isolation
139
- ```
140
-
141
- ## Training
142
- ### Training Data Preparation
143
- The training data consists of 4 types of data:
144
-
145
- 1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part), [COCO Images](http://images.cocodataset.org/zips/train2017.zip)
146
-
147
- Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the `dataset/coco/` directory.
148
-
149
- 3. Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip))
150
-
151
- Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a [OneDrive link](https://mycuhk-my.sharepoint.com/:f:/g/personal/1155154502_link_cuhk_edu_hk/Em5yELVBvfREodKC94nOFLoBLro_LPxsOxNV44PHRWgLcA?e=zQPjsc) to download. **You must also follow the rules that the original datasets require.**
152
-
153
- 4. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)
154
-
155
- 5. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset)
156
-
157
- Download them from the above links, and organize them as follows.
158
-
159
- ```
160
- ├── dataset
161
- │   ├── ade20k
162
- │   │   ├── annotations
163
- │   │   └── images
164
- │   ├── coco
165
- │   │   └── train2017
166
- │   │   ├── 000000000009.jpg
167
- │   │   └── ...
168
- │   ├── cocostuff
169
- │   │   └── train2017
170
- │   │   ├── 000000000009.png
171
- │   │   └── ...
172
- │   ├── llava_dataset
173
- │   │   └── llava_instruct_150k.json
174
- │   ├── mapillary
175
- │   │   ├── config_v2.0.json
176
- │   │   ├── testing
177
- │   │   ├── training
178
- │   │   └── validation
179
- │   ├── reason_seg
180
- │   │   └── ReasonSeg
181
- │   │   ├── train
182
- │   │   ├── val
183
- │   │   └── explanatory
184
- │   ├── refer_seg
185
- │   │   ├── images
186
- │   │   | ├── saiapr_tc-12
187
- │   │   | └── mscoco
188
- │   │   | └── images
189
- │   │   | └── train2014
190
- │   │   ├── refclef
191
- │   │   ├── refcoco
192
- │   │   ├── refcoco+
193
- │   │   └── refcocog
194
- │   └── vlpart
195
- │   ├── paco
196
- │ │ └── annotations
197
- │   └── pascal_part
198
- │   ├── train.json
199
- │ └── VOCdevkit
200
- ```
201
-
202
- ### Pre-trained weights
203
-
204
- #### LLaVA
205
- To train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`.
206
-
207
- #### SAM ViT-H weights
208
- Download SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth).
209
-
210
- ### Training
211
- ```
212
- deepspeed --master_port=24999 train_ds.py \
213
- --version="PATH_TO_LLaVA" \
214
- --dataset_dir='./dataset' \
215
- --vision_pretrained="PATH_TO_SAM" \
216
- --dataset="sem_seg||refer_seg||vqa||reason_seg" \
217
- --sample_rates="9,3,3,1" \
218
- --exp_name="lisa-7b"
219
- ```
220
- When training is finished, to get the full model weight:
221
- ```
222
- cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
223
- ```
224
-
225
- ### Merge LoRA Weight
226
- Merge the LoRA weights of `pytorch_model.bin`, save the resulting model into your desired path in the Hugging Face format:
227
- ```
228
- CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
229
- --version="PATH_TO_LLaVA" \
230
- --weight="PATH_TO_pytorch_model.bin" \
231
- --save_path="PATH_TO_SAVED_MODEL"
232
- ```
233
-
234
- For example:
235
- ```
236
- CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
237
- --version="./LLaVA/LLaVA-Lightning-7B-v1-1" \
238
- --weight="lisa-7b/pytorch_model.bin" \
239
- --save_path="./LISA-7B"
240
- ```
241
-
242
- ### Validation
243
- ```
244
- deepspeed --master_port=24999 train_ds.py \
245
- --version="PATH_TO_LISA_HF_Model_Directory" \
246
- --dataset_dir='./dataset' \
247
- --vision_pretrained="PATH_TO_SAM" \
248
- --exp_name="lisa-7b" \
249
- --eval_only
250
- ```
251
-
252
- Note: the `v1` model is trained using both `train+val` sets, so please use the `v0` model to reproduce the validation results. (To use the `v0` models, please first checkout to the legacy version repo with `git checkout 0e26916`.)
253
-
254
-
255
- ## Inference
256
-
257
- To chat with [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1) or [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory):
258
- (Note that `chat.py` currently does not support `v0` models (i.e., `LISA-13B-llama2-v0` and `LISA-13B-llama2-v0-explanatory`), if you want to use the `v0` models, please first checkout to the legacy version repo `git checkout 0e26916`.)
259
- ```
260
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'
261
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory'
262
- ```
263
- To use `bf16` or `fp16` data type for inference:
264
- ```
265
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16'
266
- ```
267
- To use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality):
268
- ```
269
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit
270
- CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit
271
- ```
272
- Hint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G.
273
-
274
- After that, input the text prompt and then the image path. For example,
275
- ```
276
- - Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.
277
- - Please input the image path: imgs/example1.jpg
278
-
279
- - Please input your prompt: Can you segment the food that tastes spicy and hot?
280
- - Please input the image path: imgs/example2.jpg
281
- ```
282
- The results should be like:
283
- <p align="center"> <img src="imgs/example1.jpg" width="22%"> <img src="vis_output/example1_masked_img_0.jpg" width="22%"> <img src="imgs/example2.jpg" width="25%"> <img src="vis_output/example2_masked_img_0.jpg" width="25%"> </p>
284
-
285
- ## Deployment
286
- ```
287
- CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit'
288
- CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit'
289
- ```
290
- By default, we use 4-bit quantization. Feel free to delete the `--load_in_4bit` argument for 16-bit inference or replace it with `--load_in_8bit` argument for 8-bit inference.
291
-
292
-
293
- ## Dataset
294
- In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from <a href="https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing">**this link**</a>.
295
-
296
- Each image is provided with an annotation JSON file:
297
- ```
298
- image_1.jpg, image_1.json
299
- image_2.jpg, image_2.json
300
- ...
301
- image_n.jpg, image_n.json
302
- ```
303
- Important keys contained in JSON files:
304
- ```
305
- - "text": text instructions.
306
- - "is_sentence": whether the text instructions are long sentences.
307
- - "shapes": target polygons.
308
- ```
309
-
310
- The elements of the "shapes" exhibit two categories, namely **"target"** and **"ignore"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process.
311
-
312
- We provide a <a href="https://github.com/dvlab-research/LISA/blob/main/utils/data_processing.py">**script**</a> that demonstrates how to process the annotations:
313
- ```
314
- python3 utils/data_processing.py
315
- ```
316
-
317
- Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the "text" field. During training, users may randomly select one as the text query to obtain a better model.
318
-
319
-
320
- ## Citation
321
- If you find this project useful in your research, please consider citing:
322
-
323
- ```
324
- @article{lai2023lisa,
325
- title={LISA: Reasoning Segmentation via Large Language Model},
326
- author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya},
327
- journal={arXiv preprint arXiv:2308.00692},
328
- year={2023}
329
- }
330
- @article{yang2023improved,
331
- title={An Improved Baseline for Reasoning Segmentation with Large Language Model},
332
- author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya},
333
- journal={arXiv preprint arXiv:2312.17240},
334
- year={2023}
335
- }
336
- ```
337
-
338
- ## Acknowledgement
339
- - This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything).
 
1
+ title: LISA-AVS
2
+ emoji: 🦁
3
+ colorFrom: green
4
+ colorTo: gray
5
+ sdk: gradio
6
+ sdk_version: 5.31.0
7
+ app_file: app.py
8
+ pinned: false
9
+ short_description: LISA VLM finetuned using AVS-Bench dataset