---
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: image-segmentation
datasets:
  - WhynotHug/DRSeg
tags:
  - pixdlm
  - cvpr-2026
  - compute-transparency
  - reasoning-segmentation
  - uav
  - remote-sensing
  - vision-language
  - image-segmentation
---

# 【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

[![Paper](https://img.shields.io/badge/arXiv-2604.15670-b31b1b)](https://arxiv.org/abs/2604.15670)
[![Model](https://img.shields.io/badge/HuggingFace-WhynotHug%2FPixDLM-yellow)](https://huggingface.co/WhynotHug/PixDLM)
[![Dataset](https://img.shields.io/badge/HuggingFace-WhynotHug%2FDRSeg-green)](https://huggingface.co/datasets/WhynotHug/DRSeg)
[![Project Page](https://img.shields.io/badge/Project-PixDLM-blue)](https://huggingface.co/spaces/WhynotHug/PixDLM)


## Highlights

- UAV reasoning segmentation: instruction following, visual reasoning, and
  pixel-level segmentation in high-resolution aerial scenes.
- DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
- Dual-path visual design: a language-aligned semantic path and a pixel-level
  perception path for small-object UAV scenes.
- Reproducible scripts for setup, evaluation, and training.

## Repository Layout

```text
PixDLM/
├── model/                    # PixDLM, LLaVA, and segmentation modules
├── utils/                    # DRSeg and referring-segmentation data utilities
├── configs/                  # CLIP preprocessing configs
├── scripts/                  # Setup, evaluation, training, release helpers
├── docs/                     # Reproducibility and data/model documentation
├── examples/                 # Minimal local examples and expected file layout
├── pretrained/pixdlm-7b/     # Lightweight model config/tokenizer files
├── release/huggingface/      # Model card, dataset card, Space, upload helpers
├── eval.py                   # Evaluation entry point
├── train_ds.py               # Training entry point
└── requirements.txt
```

Large files are intentionally not committed to this source tree. Download them
with the commands below.

## Quick Start

### 1. Create Environment

```bash
git clone https://huggingface.co/WhynotHug/PixDLM
cd PixDLM

conda create -n pixdlm python=3.10 -y
conda activate pixdlm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```

### 2. Download Model and Data

```bash
python scripts/download_assets.py \
  --model-repo WhynotHug/PixDLM \
  --dataset-repo WhynotHug/DRSeg \
  --output-dir .
```

Expected local layout:

```text
PixDLM/
├── pretrained/
│   └── pixdlm-7b/                 # PixDLM HF model snapshot
├── checkpoints/
│   ├── clip-vit-large-patch14/    # CLIP vision tower
│   └── sam2_checkpoints/          # Optional SAM/SAM2 checkpoints
└── data/
    └── DRSeg/
        ├── DRtrain/
        ├── DRval/
        ├── DRtest/
        └── label/
```

If your DRSeg archive uses `label/` instead of `labels/`, run:

```bash
python scripts/prepare_drseg.py --data-root data/DRSeg
```

### 3. Run Evaluation

Single GPU:

```bash
bash scripts/eval_drseg.sh \
  --gpus 0 \
  --model pretrained/pixdlm-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14
```

Multi-GPU:

```bash
bash scripts/eval_drseg.sh \
  --gpus 0,1,2,3,4,5,6,7 \
  --model pretrained/pixdlm-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14 \
  --exp pixdlm_drseg_test
```

Important: PyTorch `DistributedSampler(drop_last=False)` pads the test set when
the sample count is not divisible by the number of ranks. For exact paper-table
reproduction on DRSeg test, run single GPU or use a no-padding sampler.

### 4. Run Training

```bash
bash scripts/train_drseg.sh \
  --gpus 0,1,2,3,4,5,6,7 \
  --base-model checkpoints/llava-v1.6-vicuna-7b \
  --data data/DRSeg \
  --clip checkpoints/clip-vit-large-patch14 \
  --exp pixdlm_train
```

## Input and Output

PixDLM takes one UAV image and one reasoning-oriented question as input. The
question identifies a target through spatial, visual-attribute, or scene-context
reasoning, for example "Which vehicle is closest to the intersection and may
affect traffic flow?"

The model outputs a textual answer and a target segmentation mask. Evaluation
also writes visualizations and metrics to:

```text
outputs/<exp_name>/<dataset_name>/with_cot/
logs/<exp_name>/
```

Typical per-sample artifacts include the input image, predicted mask, ground
truth mask, red/green overlay, and a JSON result file.

## Dataset

DRSeg is available at:

- https://huggingface.co/datasets/WhynotHug/DRSeg

It contains 10,000 UAV images with instance masks and reasoning annotations:

- Train: 2,999 samples
- Validation: 2,000 samples
- Test: 5,001 samples

Reasoning types are balanced across spatial, attribute, and scene-level
reasoning.

## Citation

```bibtex
@inproceedings{ke2026pixdlm,
  title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
  author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}
```

## Acknowledgements

This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning
segmentation ecosystem. Please follow the licenses of all upstream components.