PixDLM / README.md

Upload README.md

f40fa58 verified 8 days ago

5.54 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	pipeline_tag: image-segmentation
	datasets:
	- WhynotHug/DRSeg
	tags:
	- pixdlm
	- cvpr-2026
	- compute-transparency
	- reasoning-segmentation
	- uav
	- remote-sensing
	- vision-language
	- image-segmentation
	---

	# 【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

	[![Paper](https://img.shields.io/badge/arXiv-2604.15670-b31b1b)](https://arxiv.org/abs/2604.15670)
	[![Model](https://img.shields.io/badge/HuggingFace-WhynotHug%2FPixDLM-yellow)](https://huggingface.co/WhynotHug/PixDLM)
	[![Dataset](https://img.shields.io/badge/HuggingFace-WhynotHug%2FDRSeg-green)](https://huggingface.co/datasets/WhynotHug/DRSeg)
	[![Project Page](https://img.shields.io/badge/Project-PixDLM-blue)](https://huggingface.co/spaces/WhynotHug/PixDLM)


	## Highlights

	- UAV reasoning segmentation: instruction following, visual reasoning, and
	pixel-level segmentation in high-resolution aerial scenes.
	- DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
	- Dual-path visual design: a language-aligned semantic path and a pixel-level
	perception path for small-object UAV scenes.
	- Reproducible scripts for setup, evaluation, and training.

	## Repository Layout

	```text
	PixDLM/
	├── model/ # PixDLM, LLaVA, and segmentation modules
	├── utils/ # DRSeg and referring-segmentation data utilities
	├── configs/ # CLIP preprocessing configs
	├── scripts/ # Setup, evaluation, training, release helpers
	├── docs/ # Reproducibility and data/model documentation
	├── examples/ # Minimal local examples and expected file layout
	├── pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files
	├── release/huggingface/ # Model card, dataset card, Space, upload helpers
	├── eval.py # Evaluation entry point
	├── train_ds.py # Training entry point
	└── requirements.txt
	```

	Large files are intentionally not committed to this source tree. Download them
	with the commands below.

	## Quick Start

	### 1. Create Environment

	```bash
	git clone https://huggingface.co/WhynotHug/PixDLM
	cd PixDLM

	conda create -n pixdlm python=3.10 -y
	conda activate pixdlm
	pip install -r requirements.txt
	pip install flash-attn --no-build-isolation
	```

	### 2. Download Model and Data

	```bash
	python scripts/download_assets.py \
	--model-repo WhynotHug/PixDLM \
	--dataset-repo WhynotHug/DRSeg \
	--output-dir .
	```

	Expected local layout:

	```text
	PixDLM/
	├── pretrained/
	│ └── pixdlm-7b/ # PixDLM HF model snapshot
	├── checkpoints/
	│ ├── clip-vit-large-patch14/ # CLIP vision tower
	│ └── sam2_checkpoints/ # Optional SAM/SAM2 checkpoints
	└── data/
	└── DRSeg/
	├── DRtrain/
	├── DRval/
	├── DRtest/
	└── label/
	```

	If your DRSeg archive uses `label/` instead of `labels/`, run:

	```bash
	python scripts/prepare_drseg.py --data-root data/DRSeg
	```

	### 3. Run Evaluation

	Single GPU:

	```bash
	bash scripts/eval_drseg.sh \
	--gpus 0 \
	--model pretrained/pixdlm-7b \
	--data data/DRSeg \
	--clip checkpoints/clip-vit-large-patch14
	```

	Multi-GPU:

	```bash
	bash scripts/eval_drseg.sh \
	--gpus 0,1,2,3,4,5,6,7 \
	--model pretrained/pixdlm-7b \
	--data data/DRSeg \
	--clip checkpoints/clip-vit-large-patch14 \
	--exp pixdlm_drseg_test
	```

	Important: PyTorch `DistributedSampler(drop_last=False)` pads the test set when
	the sample count is not divisible by the number of ranks. For exact paper-table
	reproduction on DRSeg test, run single GPU or use a no-padding sampler.

	### 4. Run Training

	```bash
	bash scripts/train_drseg.sh \
	--gpus 0,1,2,3,4,5,6,7 \
	--base-model checkpoints/llava-v1.6-vicuna-7b \
	--data data/DRSeg \
	--clip checkpoints/clip-vit-large-patch14 \
	--exp pixdlm_train
	```

	## Input and Output

	PixDLM takes one UAV image and one reasoning-oriented question as input. The
	question identifies a target through spatial, visual-attribute, or scene-context
	reasoning, for example "Which vehicle is closest to the intersection and may
	affect traffic flow?"

	The model outputs a textual answer and a target segmentation mask. Evaluation
	also writes visualizations and metrics to:

	```text
	outputs/<exp_name>/<dataset_name>/with_cot/
	logs/<exp_name>/
	```

	Typical per-sample artifacts include the input image, predicted mask, ground
	truth mask, red/green overlay, and a JSON result file.

	## Dataset

	DRSeg is available at:

	- https://huggingface.co/datasets/WhynotHug/DRSeg

	It contains 10,000 UAV images with instance masks and reasoning annotations:

	- Train: 2,999 samples
	- Validation: 2,000 samples
	- Test: 5,001 samples

	Reasoning types are balanced across spatial, attribute, and scene-level
	reasoning.

	## Citation

	```bibtex
	@inproceedings{ke2026pixdlm,
	title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
	author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
	year={2026}
	}
	```

	## Acknowledgements

	This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning
	segmentation ecosystem. Please follow the licenses of all upstream components.