【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Highlights

UAV reasoning segmentation: instruction following, visual reasoning, and pixel-level segmentation in high-resolution aerial scenes.
DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
Dual-path visual design: a language-aligned semantic path and a pixel-level perception path for small-object UAV scenes.
Reproducible scripts for setup, evaluation, and training.

Repository Layout

PixDLM/
├── model/                    # PixDLM, LLaVA, and segmentation modules
├── utils/                    # DRSeg and referring-segmentation data utilities
├── configs/                  # CLIP preprocessing configs
├── scripts/                  # Setup, evaluation, training, release helpers
├── docs/                     # Reproducibility and data/model documentation
├── examples/                 # Minimal local examples and expected file layout
├── pretrained/pixdlm-7b/     # Lightweight model config/tokenizer files
├── release/huggingface/      # Model card, dataset card, Space, upload helpers
├── eval.py                   # Evaluation entry point
├── train_ds.py               # Training entry point
└── requirements.txt

Large files are intentionally not committed to this source tree. Download them with the commands below.

Quick Start

1. Create Environment

git clone https://huggingface.co/WhynotHug/PixDLM
cd PixDLM

conda create -n pixdlm python=3.10 -y
conda activate pixdlm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Download Model and Data

python scripts/download_assets.py \
  --model-repo WhynotHug/PixDLM \
  --dataset-repo WhynotHug/DRSeg \
  --output-dir .

Expected local layout:

PixDLM/
├── pretrained/
│   └── pixdlm-7b/                 # PixDLM HF model snapshot
├── checkpoints/
│   ├── clip-vit-large-patch14/    # CLIP vision tower
│   └── sam2_checkpoints/          # Optional SAM/SAM2 checkpoints
└── data/
    └── DRSeg/
        ├── DRtrain/
        ├── DRval/
        ├── DRtest/
        └── label/

If your DRSeg archive uses label/ instead of labels/, run:

python scripts/prepare_drseg.py --data-root data/DRSeg

3. Run Evaluation

Single GPU:

bash scripts/eval_drseg.sh \
  --gpus 0 \
  --model pretrained/pixdlm-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14

Multi-GPU:

bash scripts/eval_drseg.sh \
  --gpus 0,1,2,3,4,5,6,7 \
  --model pretrained/pixdlm-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14 \
  --exp pixdlm_drseg_test

Important: PyTorch DistributedSampler(drop_last=False) pads the test set when the sample count is not divisible by the number of ranks. For exact paper-table reproduction on DRSeg test, run single GPU or use a no-padding sampler.

4. Run Training

bash scripts/train_drseg.sh \
  --gpus 0,1,2,3,4,5,6,7 \
  --base-model checkpoints/llava-v1.6-vicuna-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14 \
  --exp pixdlm_train

Input and Output

PixDLM takes one UAV image and one reasoning-oriented question as input. The question identifies a target through spatial, visual-attribute, or scene-context reasoning, for example "Which vehicle is closest to the intersection and may affect traffic flow?"

The model outputs a textual answer and a target segmentation mask. Evaluation also writes visualizations and metrics to:

outputs/<exp_name>/<dataset_name>/with_cot/
logs/<exp_name>/

Typical per-sample artifacts include the input image, predicted mask, ground truth mask, red/green overlay, and a JSON result file.

Dataset

DRSeg is available at:

https://huggingface.co/datasets/WhynotHug/DRSeg

It contains 10,000 UAV images with instance masks and reasoning annotations:

Train: 2,999 samples
Validation: 2,000 samples
Test: 5,001 samples

Reasoning types are balanced across spatial, attribute, and scene-level reasoning.

Citation

@inproceedings{ke2026pixdlm,
  title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
  author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Acknowledgements

This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning segmentation ecosystem. Please follow the licenses of all upstream components.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train WhynotHug/PixDLM

Paper for WhynotHug/PixDLM

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Paper • 2604.15670 • Published Apr 17