[martin-dev] fix readme
Browse files
README.md
CHANGED
|
@@ -1,245 +1,17 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
- [Run Qwen2-VL-2B Embeddings Extraction](#run-qwen2-vl-2b-embeddings-extraction)
|
| 19 |
-
- [Layers of Interest in a VLM](#layers-of-interest-in-a-vlm)
|
| 20 |
-
- [Retrieving All Named Modules](#retrieving-all-named-modules)
|
| 21 |
-
- [Matching Layers](#matching-layers)
|
| 22 |
-
- [Feature Extraction using HuggingFace Datasets](#feature-extraction-using-huggingface-datasets)
|
| 23 |
-
- [Output Database](#output-database)
|
| 24 |
-
- [Demo: Principal Component Analysis over Primitive Concept](#principal-component-analysis-over-primitive-concept)
|
| 25 |
-
- [Contributing to VLM-Lens](#contributing-to-vlm-lens)
|
| 26 |
-
- [Miscellaneous](#miscellaneous)
|
| 27 |
-
|
| 28 |
-
## Environment Setup
|
| 29 |
-
We recommend using a virtual environment to manage your dependencies. You can create one using the following command to create a virtual environment under
|
| 30 |
-
```bash
|
| 31 |
-
virtualenv --no-download "venv/vlm-lens-base" --prompt "vlm-lens-base" # Or "python3.10 -m venv venv/vlm-lens-base"
|
| 32 |
-
source venv/vlm-lens-base/bin/activate
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
Then, install the required dependencies:
|
| 36 |
-
```bash
|
| 37 |
-
pip install --upgrade pip
|
| 38 |
-
pip install -r envs/base/requirements.txt
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
There are some models that require different dependencies, and we recommend creating a separate virtual environment for each of them to avoid conflicts.
|
| 42 |
-
For such models, we have offered a separate `requirements.txt` file under `envs/<model_name>/requirements.txt`, which can be installed in the same way as above.
|
| 43 |
-
All the model-specific environments are independent of the base environment, and can be installed individually.
|
| 44 |
-
|
| 45 |
-
**Notes**:
|
| 46 |
-
1. There may be local constraints (e.g., issues caused by cluster regulations) that cause failure of the above commands. In such cases, you are encouraged to modify it whenever fit. We welcome issues and pull requests to help us keep the dependencies up to date.
|
| 47 |
-
2. Some models, due to the resources available at the development time, may not be fully supported on modern GPUs. While our released environments are tested on L40s GPUs, we recommend following the error messages to adjust the environment setups for your specific hardware.
|
| 48 |
-
|
| 49 |
-
## Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens
|
| 50 |
-
|
| 51 |
-
### General Command-Line Demo
|
| 52 |
-
|
| 53 |
-
The general command to run the quick command-line demo is:
|
| 54 |
-
```bash
|
| 55 |
-
python -m src.main \
|
| 56 |
-
--config <config-file-path> \
|
| 57 |
-
--debug
|
| 58 |
-
```
|
| 59 |
-
with an optional debug flag to see more detailed outputs.
|
| 60 |
-
|
| 61 |
-
Note that the config file should be in yaml format, and that any arguments you want to send to the huggingface API should be under the `model` key.
|
| 62 |
-
See `configs/models/qwen/qwen-2b.yaml` as an example.
|
| 63 |
-
|
| 64 |
-
### Run Qwen2-VL-2B Embeddings Extraction
|
| 65 |
-
The file `configs/models/qwen/qwen-2b.yaml` contains the configuration for running the Qwen2-VL-2B model.
|
| 66 |
-
|
| 67 |
-
```yaml
|
| 68 |
-
architecture: qwen # Architecture of the model, see more options in src/models/configs.py
|
| 69 |
-
model_path: Qwen/Qwen2-VL-2B-Instruct # HuggingFace model path
|
| 70 |
-
model: # Model configuration, i.e., arguments to pass to the model
|
| 71 |
-
- torch_dtype: auto
|
| 72 |
-
output_db: output/qwen.db # Output database file to store embeddings
|
| 73 |
-
input_dir: ./data/ # Directory containing images to process
|
| 74 |
-
prompt: "Describe the color in this image in one word." # Textual prompt
|
| 75 |
-
pooling_method: None # Pooling method to use for aggregating token embeddings over tokens (options: None, mean, max)
|
| 76 |
-
modules: # List of modules to extract embeddings from
|
| 77 |
-
- lm_head
|
| 78 |
-
- visual.blocks.31
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
To run the extraction on available GPU, use the following command:
|
| 82 |
-
```bash
|
| 83 |
-
python -m src.main --config configs/models/qwen/qwen-2b.yaml --debug
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
If there is no GPU available, you can run it on CPU with:
|
| 87 |
-
```bash
|
| 88 |
-
python -m src.main --config configs/models/qwen/qwen-2b.yaml --device cpu --debug
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
## Layers of Interest in a VLM
|
| 92 |
-
### Retrieving All Named Modules
|
| 93 |
-
Unfortunately there is no way to find which layers to potentially match to without loading the model. This can take quite a bit of system time figuring out.
|
| 94 |
-
|
| 95 |
-
Instead, we offer some cached results under `logs/` for each model, which were generated through including the `-l` or `--log-named-modules` flag when running `python -m src.main`.
|
| 96 |
-
|
| 97 |
-
When running this flag, it is not necessary to set modules or anything besides the architecture and HuggingFace model path.
|
| 98 |
-
|
| 99 |
-
### Matching Layers
|
| 100 |
-
To automatically set up which layers to find/use, one should use the Unix style strings, where you can use `*` to denote wildcards.
|
| 101 |
-
|
| 102 |
-
For example, if one wanted to match with all the attention layer's query projection layer for Qwen, simply add the following lines to the .yaml file:
|
| 103 |
-
```
|
| 104 |
-
modules:
|
| 105 |
-
- model.layers.*.self_attn.q_proj
|
| 106 |
-
```
|
| 107 |
-
## Feature Extraction using HuggingFace Datasets
|
| 108 |
-
To use VLM-Lens with either hosted or local datasets, there are multiple methods you can use depending on the location of the input images.
|
| 109 |
-
|
| 110 |
-
First, your dataset must be standardized to a format that includes the attributes of `prompt`, `label` and `image_path`. Here is a snippet of the `compling/coco-val2017-obj-qa-categories` dataset, adjusted with the former attributes:
|
| 111 |
-
|
| 112 |
-
| id | prompt | label | image_path |
|
| 113 |
-
|---|---|---|---|
|
| 114 |
-
| 397,133 | Is this A photo of a dining table on the bottom | yes | /path/to/397133.png
|
| 115 |
-
| 37,777 | Is this A photo of a dining table on the top | no | /path/to/37777.png
|
| 116 |
-
|
| 117 |
-
This can be achieved manually or using the helper script in `scripts/map_datasets.py`.
|
| 118 |
-
|
| 119 |
-
### Method 1: Using hosted datasets
|
| 120 |
-
If you are using datasets hosted on a platform such as HuggingFace, you will either use images that are also *hosted*, or ones that are *downloaded locally* with an identifier to map back to the hosted dataset (e.g., filename).
|
| 121 |
-
|
| 122 |
-
You must use the `dataset_path` attribute in your configuration file with the appropriate `dataset_split` (if it exists, otherwise leave it out).
|
| 123 |
-
|
| 124 |
-
#### 1(a): Hosted Dataset with Hosted Images
|
| 125 |
-
```yaml
|
| 126 |
-
dataset:
|
| 127 |
-
- dataset_path: compling/coco-val2017-obj-qa-categories
|
| 128 |
-
- dataset_split: val2017
|
| 129 |
-
```
|
| 130 |
-
|
| 131 |
-
#### 1(b): Hosted Dataset with Local Images
|
| 132 |
-
|
| 133 |
-
> 🚨 **NOTE**: The `image_path` attribute in the dataset must contain either filenames or relative paths, such that a cell value of `train/00023.png` can be joined with `image_dataset_path` to form the full absolute path: `/path/to/local/images/train/00023.png`. If the `image_path` attribute does not require any additional path joining, you can leave out the `image_dataset_path` attribute.
|
| 134 |
-
|
| 135 |
-
```yaml
|
| 136 |
-
dataset:
|
| 137 |
-
- dataset_path: compling/coco-val2017-obj-qa-categories
|
| 138 |
-
- dataset_split: val2017
|
| 139 |
-
- image_dataset_path: /path/to/local/images # downloaded using configs/dataset/download-coco.yaml
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
### Method 2: Using local datasets
|
| 144 |
-
#### 2(a): Local Dataset containing Image Files
|
| 145 |
-
```yaml
|
| 146 |
-
dataset:
|
| 147 |
-
- local_dataset_path: /path/to/local/CLEVR
|
| 148 |
-
- dataset_split: train # leave out if unspecified
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
#### 2(b): Local Dataset with Separate Input Image Directory
|
| 152 |
-
|
| 153 |
-
> 🚨 **NOTE**: The `image_path` attribute in the dataset must contain either filenames or relative paths, such that a cell value of `train/00023.png` can be joined with `image_dataset_path` to form the full absolute path: `/path/to/local/images/train/00023.png`. If the `image_path` attribute does not require any additional path joining, you can leave out the `image_dataset_path` attribute.
|
| 154 |
-
|
| 155 |
-
```yaml
|
| 156 |
-
dataset:
|
| 157 |
-
- local_dataset_path: /path/to/local/CLEVR
|
| 158 |
-
- dataset_split: train # leave out if unspecified
|
| 159 |
-
- image_dataset_path: /path/to/local/CLEVR/images
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
-
### Output Database
|
| 163 |
-
Specified by the `-o` and `--output-db` flags, this specifies the specific output database we want. From this, in SQL we have a single table under the name `tensors` with the following columns:
|
| 164 |
-
```
|
| 165 |
-
name, architecture, timestamp, image_path, prompt, label, layer, tensor_dim, tensor
|
| 166 |
-
```
|
| 167 |
-
where each column contains:
|
| 168 |
-
1. `name` represents the model path from HuggingFace.
|
| 169 |
-
2. `architecture` is the supported flags above.
|
| 170 |
-
3. `timestamp` is the specific time that the model was ran.
|
| 171 |
-
4. `image_path` is the absolute path to the image.
|
| 172 |
-
5. `prompt` stores the prompt used in that instance.
|
| 173 |
-
6. `label` is an optional cell that stores the "ground-truth" answer, which is helpful in use cases such as classification.
|
| 174 |
-
7. `layer` is the matched layer from `model.named_modules()`
|
| 175 |
-
8. `pooling_method` is the pooling method used for aggregating token embeddings over tokens.
|
| 176 |
-
9. `tensor_dim` is the dimension of the tensor saved.
|
| 177 |
-
10. `tensor` is the embedding saved.
|
| 178 |
-
|
| 179 |
-
## Principal Component Analysis over Primitive Concept
|
| 180 |
-
|
| 181 |
-
### Data Collection
|
| 182 |
-
|
| 183 |
-
Download license-free images for primitive concepts (e.g., colors):
|
| 184 |
-
|
| 185 |
-
```bash
|
| 186 |
-
pip install -r data/concepts/requirements.txt
|
| 187 |
-
python -m data.concepts.download --config configs/concepts/colors.yaml
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
### Embedding Extraction
|
| 191 |
-
|
| 192 |
-
Run the LLaVA model to obtain embeddings of the concept images:
|
| 193 |
-
|
| 194 |
-
```bash
|
| 195 |
-
python -m src.main --config configs/models/llava-7b/llava-7b-concepts-colors.yaml --device cuda
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
Also, run the LLaVA model to obtain embeddings of the test images:
|
| 199 |
-
|
| 200 |
-
```bash
|
| 201 |
-
python -m src.main --config configs/models/llava-7b/llava-7b.yaml --device cuda
|
| 202 |
-
```
|
| 203 |
-
|
| 204 |
-
### Run PCA
|
| 205 |
-
|
| 206 |
-
Several PCA-based analysis scripts are provided:
|
| 207 |
-
```bash
|
| 208 |
-
pip install -r src/concepts/requirements.txt
|
| 209 |
-
python -m src.concepts.pca
|
| 210 |
-
python -m src.concepts.pca_knn
|
| 211 |
-
python -m src.concepts.pca_separation
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
## Contributing to VLM-Lens
|
| 215 |
-
|
| 216 |
-
We welcome contributions to VLM-Lens! If you have suggestions, improvements, or bug fixes, please consider submitting a pull request, and we are actively reviewing them.
|
| 217 |
-
|
| 218 |
-
We generally follow the [Google Python Styles](https://google.github.io/styleguide/pyguide.html) to ensure readability, with a few exceptions stated in `.flake8`.
|
| 219 |
-
We use pre-commit hooks to ensure code quality and consistency---please make sure to run the following scripts before committing:
|
| 220 |
-
```python
|
| 221 |
-
pip install pre-commit
|
| 222 |
-
pre-commit install
|
| 223 |
-
```
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
## Miscellaneous
|
| 227 |
-
|
| 228 |
-
### Using a Cache
|
| 229 |
-
To use a specific cache, one should set the `HF_HOME` environment variable as so:
|
| 230 |
-
```
|
| 231 |
-
HF_HOME=./cache/ python -m src.main --config configs/models/clip/clip.yaml --debug
|
| 232 |
-
```
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
### Using Submodule-Based Models
|
| 236 |
-
There are some models that require separate submodules to be cloned, such as Glamm.
|
| 237 |
-
To use these models, please follow the instructions below to download the submodules.
|
| 238 |
-
|
| 239 |
-
#### Glamm
|
| 240 |
-
For Glamm (GroundingLMM), one needs to clone the separate submodules, which can be done with the following command:
|
| 241 |
-
```
|
| 242 |
-
git submodule update --recursive --init
|
| 243 |
-
```
|
| 244 |
-
|
| 245 |
-
See [our document](https://compling-wat.github.io/vlm-lens/tutorials/grounding-lmm.html) for details on the installation.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: VLM-Lens
|
| 3 |
+
emoji: 👁️
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "4.0.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# VLM-Lens 👁️🔍
|
| 13 |
+
|
| 14 |
+
A visual lens into the internals of Vision-Language Models.
|
| 15 |
+
Built with Gradio, this demo lets you explore token-level probabilities, spatial grounding, and interpretability visualizations.
|
| 16 |
+
|
| 17 |
+
> Developed by [@marstin](https://huggingface.co/marstin)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|