Improve model card: add pipeline tag, library, paper & code links, Gradio demo usage

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: any-to-any
4
+ library_name: transformers
5
+ ---
6
+
7
+ # MedPLIB: Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine
8
+
9
+ This repository contains the official implementation of the paper: [**Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine**](https://huggingface.co/papers/2412.09278).
10
+
11
+ <p align="center">
12
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/logo.png" width="150" style="margin-bottom: 0.2;"/>
13
+ </p>
14
+
15
+ <p align="center">
16
+ <a href="https://github.com/ShawnHuang497/MedPLIB"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&style=flat-square" alt="GitHub Code"></a>
17
+ </p>
18
+
19
+ ## Abstract
20
+ Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
21
+
22
+ ## Highlights
23
+ MedPLIB shows excellent performance in pixel-level understanding in biomedical field.
24
+
25
+ - ✨ MedPLIB is a biomedical MLLM with a huge breadth of abilities and supports multiple imaging modalities. Not only can it perform image-level visual language tasks like VQA, but it also facilitates question answering at the pixel level.
26
+
27
+ <p align="center">
28
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/capa.png" style="margin-bottom: 0.2;"/>
29
+ </p>
30
+
31
+ - ✨ We constructe MeCoVQA Dataset. It comprises an array of 8 modalities with a total of 310k pairs for complex medical imaging question answering and image region understanding.
32
+ <p align="center">
33
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/data.png" style="margin-bottom: 0.2;"/>
34
+ </p>
35
+
36
+ ## Installation
37
+ For detailed instructions, please refer to the [GitHub repository](https://github.com/ShawnHuang497/MedPLIB).
38
+
39
+ 1. **Clone this repository and navigate to MedPLIB folder**
40
+ ```bash
41
+ git clone https://github.com/ShawnHuang497/MedPLIB.git
42
+ cd MedPLIB
43
+ ```
44
+ 2. **Install Package**
45
+ ```Shell
46
+ conda create -n medplib python=3.10 -y
47
+ conda activate medplib
48
+ pip install --upgrade pip
49
+ pip install -r requirements.txt
50
+ ```
51
+ 3. **Install additional packages for training cases**
52
+ ```Shell
53
+ pip install ninja==1.11.1.1
54
+ pip install flash-attn==2.5.2 --no-build-isolation
55
+ ```
56
+
57
+ ## Sample Usage: Gradio Web UI
58
+ We recommend trying our web demo, which includes all the features currently supported by MedPLIB. To run our demo, you need to download or train MedPLIB to make the checkpoints locally. Please run the following commands one by one.
59
+
60
+ ```bash
61
+ # launch the server controller
62
+ python -m model.serve.controller --host 0.0.0.0 --port 64000
63
+ ```
64
+
65
+ ```bash
66
+ # launch the web server
67
+ python -m model.serve.gradio_web_server --controller http://localhost:64000 --model-list-mode reload --add_region_feature --port 64001
68
+ ```
69
+
70
+ ```bash
71
+ # launch the model worker
72
+ CUDA_VISIBLE_DEVICES=0 python -m model.serve.model_worker --host localhost --controller http://localhost:64000 --port 64002 --worker http://localhost:64002 --model-path /path/to/the/medplib_checkpoints --add_region_feature --device_map cuda --vision_pretrained /path/to/the/sam-med2d_b.pth
73
+ ```
74
+
75
+ - **Pixel grounding:**
76
+ <p align="center">
77
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/seg.gif" style="width: 70%;"/>
78
+ </p>
79
+
80
+ - **Region VQA:**
81
+ <p align="center">
82
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/rqa.gif" style="width: 70%;"/>
83
+ </p>
84
+
85
+ - **VQA:**
86
+ <p align="center">
87
+ <img src="https://github.com/ShawnHuang497/MedPLIB/raw/main/assets/vqa.gif" style="width: 70%;"/>
88
+ </p>
89
+
90
+ ## Acknowledgement
91
+ We thank the following works for giving us the inspiration and part of the code: [LISA](https://github.com/dvlab-research/LISA), [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA), [LLaVA](https://github.com/haotian-liu/LLaVA), [SAM-Med2D](https://github.com/OpenGVLab/SAM-Med2D), [SAM](https://github.com/facebookresearch/segment-anything) and [SEEM](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once).
92
+
93
+ ## Model Use
94
+ ### Intended Use
95
+ The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.
96
+ ### Primary Intended Use
97
+ The primary intended use is to support AI researchers reproducing and building on top of this work. MedPLIB and its associated models should be helpful for exploring various biomedical pixel grunding and vision question answering (VQA) research questions.
98
+ ### Out-of-Scope Use
99
+ Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases.
100
+
101
+ ## Citation
102
+ If you find our paper and code useful in your research, please consider giving a star and citation.
103
+
104
+ ```bibtex
105
+ @article{huang2024towards,
106
+ title={Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine},
107
+ author={Huang, Xiaoshuang and Shen, Lingdong and Liu, Jia and Shang, Fangxin and Li, Hongxiang and Huang, Haifeng and Yang, Yehui},
108
+ journal={arXiv preprint arXiv:2412.09278},
109
+ year={2024}
110
+ }
111
+ ```