DenseVLM / README.md

Update README.md

b9ba259 verified 8 months ago

6.32 kB


	<p align="center">
	<h1 align="center">Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction</h1>
	<p align="center">
	<a href="https://lyhisme.github.io/"><strong>Yunheng Li</strong></a>
	·
	<a href="https://zcablii.github.io/"><strong>Yuxuan Li</strong></a>
	·
	<a href="https://github.com/ashun989"><strong>Quansheng Zeng</strong></a>
	·
	<a href="https://whai362.github.io/"><strong>Wenhai Wang</strong></a>
	·
	<a href="https://houqb.github.io/"><strong>Qibin Hou†</strong></a>
	·
	<a href="https://mmcheng.net/cmm/"><strong>Ming-Ming Cheng</strong></a>
	</p>
	<h2 align="center">Accepted By ICCV 2025!</h2>

	### [[Paper](https://arxiv.org/pdf/2412.06244)] [[Github](https://github.com/HVision-NKU/DenseVLM)] [[Pretrained models](https://github.com/HVision-NKU/DenseVLM/tree/main#)]

	## Contributions
	- 🔥 We identify the foreground bias issue in existing VLMs and propose region-text alignment by incorporating explicit semantic structuring through category guidance.
	- 🔥 We propose DenseVLM, a region-language alignment framework that uses a strong VLM to retrieve categories for unlabeled regions and decouples foreground and background features to reduce bias.
	- 🔥 Extensive experiments on dense prediction benchmarks show that our DenseVLM outperforms previous methods and exhibits promising scalability.

	<!-- <p align="center">
	<img src="assets/Foreground_bias.png" alt="Problem analysis of foreground bias." height="140" style="display: inline; margin: 0 5px;">
	<img src="assets/DenseVLM_Comparison.png" alt="Comparison of different VLMs." height="160" style="display: inline; margin: 0 5px;">
	</p> -->


	<p align="center">
	<img src="assets/Foreground_bias.png" alt="Problem analysis of foreground bias." height="180" style="display: inline; margin: 0 5px;">
	<img src="assets/Foreground_bias_2.png" alt="Comparison of different VLMs." height="180" style="display: inline; margin: 0 5px;">
	</p>


	<p align="center">
	<img src="assets/DenseVLM_Comparison.png" style="display: inline">
	</p>

	## Overview

	DenseVLM is an annotation-free fine-tuning framework for open-vocabulary dense prediction tasks, which retrieves region-level semantics from a powerful vision-language model and decouples foreground and background features to achieve unbiased region-language alignment and improved open-vocabulary dense prediction.

	<p align="center">
	<img src="assets/DenseVLM_Overview.png" style="display: inline">
	</p>

	<p align="center">
	<img src="assets/DenseVLM_Performance.png" alt="Problem analysis of foreground bias." height="170" style="display: inline; margin: 0 5px;">
	<img src="assets/DenseVLM_Visualizations.png" alt="Comparison of different VLMs." height="170" style="display: inline; margin: 0 5px;">
	</p>


	## TODO
	- [x] Release the training and inference code of DenseVLM.
	- [x] Supports training and inference code for RegionCLIP and CLIPSelf.
	- [ ] Release the code to integrate DenseVLM into CAT-Seg.
	- [ ] Release the code to integrate DenseVLM into F-ViT.

	## Quick Start
	- 🚀 Linux system with CUDA 11.8
	- 🚀 At least one RTX 3090 GPU (4 GPUs are default for training ～23min/epoch)

	#### 1. Create Conda Environment
	- The provided environment is suggested for reproducing our results, similar configurations may also work.

	```
	git clone git@github.com:HVision-NKU/DenseVLM.git
	cd DenseVLM

	conda create -n DenseVLM python=3.8.20
	conda activate DenseVLM
	pip install -r requirements.txt
	pip install -e . -v
	```

	#### 2. Data Preparation
	The main experiments are conducted using images from [COCO](https://cocodataset.org/#home) and [ADE20k](http://sceneparsing.csail.mit.edu) datasets. Please prepare datasets and organize them like the
	following:

	```text
	DenseVLM/
	├── data
	├── coco
	├── annotations
	├── instances_train2017.json
	├── panoptic_val2017.json
	├── panoptic_val2017/
	├── train2017/
	├── val2017/
	├── coco_pseudo_4764.json
	├── coco_proposals.json
	├── ADEChallengeData2016
	├── ade20k_panoptic_val/
	├── images/validation/
	├── ade20k_panoptic_val.json
	```

	#### 3. Checkpoints
	Please download the pretrained weights from [huggingface](https://huggingface.co/lyhisme/DenseVLM) and organize them like the
	```text
	DenseVLM/
	├── checkpoints
	├── EVA02_CLIP_B_psz16_s8B.pt
	├── clipself_coco_6_save6_512_eva_vitl14_24layers.pt
	├── densevlm_coco_6_save6_512_eva_vib16_12layers.pt
	```

	If using a fine-tuned CLIP, you can directly use it. For example:

	```python
	model = open_clip.create_model(
	'EVA02-CLIP-B-16', pretrained='eva', cache_dir='checkpoints/densevlm_coco_6_save6_512_eva_vib16_12layers.pt'
	)
	```

	#### 4. Training and Testing

	To fine-tune the CLIP model using densevlm, run:
	```bash
	bash scripts/train_densevlm_coco_image_patches_eva_vitb16.sh
	```

	To evaluate the CLIP model fine-tuned with densevlm, run:
	```bash
	bash scripts/test_coco_eva_vitb16_macc_boxes_masks.sh path/to/checkpoint.pt 2 densevlm_coco_test

	bash scripts/test_ade_eva_vitb16_macc_boxes_masks.sh path/to/checkpoint.pt 2 densevlm_ade_test
	```

	## 🙏 Citation:
	If you find this project useful, please consider citing:

	```bibtex
	@article{li2024densevlm,
	title={Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction},
	author={Li, Yunheng and Li, Yuxuan and Zeng, Quansheng and Wang, Wenhai and Hou, Qibin and Cheng, Ming-Ming},
	journal={arXiv preprint arXiv:2412.06244},
	year={2024}
	}

	@InProceedings{li2024cascadeclip,
	title={Cascade-{CLIP}: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation},
	author={Li, Yunheng and Li, Zhong-Yu and Zeng, Quan-Sheng and Hou, Qibin and Cheng, Ming-Ming},
	booktitle={Proceedings of the 41st International Conference on Machine Learning},
	pages={28243--28258},
	year={2024},
	volume={235},
	month={21--27 Jul},
	publisher={PMLR}
	}
	```

	## License
	Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only.
	Any commercial use should get formal permission first.