PixDLM / README.md
WhynotHug's picture
Upload README.md
f40fa58 verified
|
Raw
History Blame Contribute Delete
5.54 kB
---
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: image-segmentation
datasets:
- WhynotHug/DRSeg
tags:
- pixdlm
- cvpr-2026
- compute-transparency
- reasoning-segmentation
- uav
- remote-sensing
- vision-language
- image-segmentation
---
# 【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
[![Paper](https://img.shields.io/badge/arXiv-2604.15670-b31b1b)](https://arxiv.org/abs/2604.15670)
[![Model](https://img.shields.io/badge/HuggingFace-WhynotHug%2FPixDLM-yellow)](https://huggingface.co/WhynotHug/PixDLM)
[![Dataset](https://img.shields.io/badge/HuggingFace-WhynotHug%2FDRSeg-green)](https://huggingface.co/datasets/WhynotHug/DRSeg)
[![Project Page](https://img.shields.io/badge/Project-PixDLM-blue)](https://huggingface.co/spaces/WhynotHug/PixDLM)
## Highlights
- UAV reasoning segmentation: instruction following, visual reasoning, and
pixel-level segmentation in high-resolution aerial scenes.
- DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
- Dual-path visual design: a language-aligned semantic path and a pixel-level
perception path for small-object UAV scenes.
- Reproducible scripts for setup, evaluation, and training.
## Repository Layout
```text
PixDLM/
β”œβ”€β”€ model/ # PixDLM, LLaVA, and segmentation modules
β”œβ”€β”€ utils/ # DRSeg and referring-segmentation data utilities
β”œβ”€β”€ configs/ # CLIP preprocessing configs
β”œβ”€β”€ scripts/ # Setup, evaluation, training, release helpers
β”œβ”€β”€ docs/ # Reproducibility and data/model documentation
β”œβ”€β”€ examples/ # Minimal local examples and expected file layout
β”œβ”€β”€ pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files
β”œβ”€β”€ release/huggingface/ # Model card, dataset card, Space, upload helpers
β”œβ”€β”€ eval.py # Evaluation entry point
β”œβ”€β”€ train_ds.py # Training entry point
└── requirements.txt
```
Large files are intentionally not committed to this source tree. Download them
with the commands below.
## Quick Start
### 1. Create Environment
```bash
git clone https://huggingface.co/WhynotHug/PixDLM
cd PixDLM
conda create -n pixdlm python=3.10 -y
conda activate pixdlm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```
### 2. Download Model and Data
```bash
python scripts/download_assets.py \
--model-repo WhynotHug/PixDLM \
--dataset-repo WhynotHug/DRSeg \
--output-dir .
```
Expected local layout:
```text
PixDLM/
β”œβ”€β”€ pretrained/
β”‚ └── pixdlm-7b/ # PixDLM HF model snapshot
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ clip-vit-large-patch14/ # CLIP vision tower
β”‚ └── sam2_checkpoints/ # Optional SAM/SAM2 checkpoints
└── data/
└── DRSeg/
β”œβ”€β”€ DRtrain/
β”œβ”€β”€ DRval/
β”œβ”€β”€ DRtest/
└── label/
```
If your DRSeg archive uses `label/` instead of `labels/`, run:
```bash
python scripts/prepare_drseg.py --data-root data/DRSeg
```
### 3. Run Evaluation
Single GPU:
```bash
bash scripts/eval_drseg.sh \
--gpus 0 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14
```
Multi-GPU:
```bash
bash scripts/eval_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_drseg_test
```
Important: PyTorch `DistributedSampler(drop_last=False)` pads the test set when
the sample count is not divisible by the number of ranks. For exact paper-table
reproduction on DRSeg test, run single GPU or use a no-padding sampler.
### 4. Run Training
```bash
bash scripts/train_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--base-model checkpoints/llava-v1.6-vicuna-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_train
```
## Input and Output
PixDLM takes one UAV image and one reasoning-oriented question as input. The
question identifies a target through spatial, visual-attribute, or scene-context
reasoning, for example "Which vehicle is closest to the intersection and may
affect traffic flow?"
The model outputs a textual answer and a target segmentation mask. Evaluation
also writes visualizations and metrics to:
```text
outputs/<exp_name>/<dataset_name>/with_cot/
logs/<exp_name>/
```
Typical per-sample artifacts include the input image, predicted mask, ground
truth mask, red/green overlay, and a JSON result file.
## Dataset
DRSeg is available at:
- https://huggingface.co/datasets/WhynotHug/DRSeg
It contains 10,000 UAV images with instance masks and reasoning annotations:
- Train: 2,999 samples
- Validation: 2,000 samples
- Test: 5,001 samples
Reasoning types are balanced across spatial, attribute, and scene-level
reasoning.
## Citation
```bibtex
@inproceedings{ke2026pixdlm,
title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
```
## Acknowledgements
This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning
segmentation ecosystem. Please follow the licenses of all upstream components.