Instructions to use WhynotHug/PixDLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WhynotHug/PixDLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="WhynotHug/PixDLM")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("WhynotHug/PixDLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Highlights
- UAV reasoning segmentation: instruction following, visual reasoning, and pixel-level segmentation in high-resolution aerial scenes.
- DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
- Dual-path visual design: a language-aligned semantic path and a pixel-level perception path for small-object UAV scenes.
- Reproducible scripts for setup, evaluation, and training.
Repository Layout
PixDLM/
├── model/ # PixDLM, LLaVA, and segmentation modules
├── utils/ # DRSeg and referring-segmentation data utilities
├── configs/ # CLIP preprocessing configs
├── scripts/ # Setup, evaluation, training, release helpers
├── docs/ # Reproducibility and data/model documentation
├── examples/ # Minimal local examples and expected file layout
├── pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files
├── release/huggingface/ # Model card, dataset card, Space, upload helpers
├── eval.py # Evaluation entry point
├── train_ds.py # Training entry point
└── requirements.txt
Large files are intentionally not committed to this source tree. Download them with the commands below.
Quick Start
1. Create Environment
git clone https://huggingface.co/WhynotHug/PixDLM
cd PixDLM
conda create -n pixdlm python=3.10 -y
conda activate pixdlm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
2. Download Model and Data
python scripts/download_assets.py \
--model-repo WhynotHug/PixDLM \
--dataset-repo WhynotHug/DRSeg \
--output-dir .
Expected local layout:
PixDLM/
├── pretrained/
│ └── pixdlm-7b/ # PixDLM HF model snapshot
├── checkpoints/
│ ├── clip-vit-large-patch14/ # CLIP vision tower
│ └── sam2_checkpoints/ # Optional SAM/SAM2 checkpoints
└── data/
└── DRSeg/
├── DRtrain/
├── DRval/
├── DRtest/
└── label/
If your DRSeg archive uses label/ instead of labels/, run:
python scripts/prepare_drseg.py --data-root data/DRSeg
3. Run Evaluation
Single GPU:
bash scripts/eval_drseg.sh \
--gpus 0 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14
Multi-GPU:
bash scripts/eval_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_drseg_test
Important: PyTorch DistributedSampler(drop_last=False) pads the test set when
the sample count is not divisible by the number of ranks. For exact paper-table
reproduction on DRSeg test, run single GPU or use a no-padding sampler.
4. Run Training
bash scripts/train_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--base-model checkpoints/llava-v1.6-vicuna-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_train
Input and Output
PixDLM takes one UAV image and one reasoning-oriented question as input. The question identifies a target through spatial, visual-attribute, or scene-context reasoning, for example "Which vehicle is closest to the intersection and may affect traffic flow?"
The model outputs a textual answer and a target segmentation mask. Evaluation also writes visualizations and metrics to:
outputs/<exp_name>/<dataset_name>/with_cot/
logs/<exp_name>/
Typical per-sample artifacts include the input image, predicted mask, ground truth mask, red/green overlay, and a JSON result file.
Dataset
DRSeg is available at:
It contains 10,000 UAV images with instance masks and reasoning annotations:
- Train: 2,999 samples
- Validation: 2,000 samples
- Test: 5,001 samples
Reasoning types are balanced across spatial, attribute, and scene-level reasoning.
Citation
@inproceedings{ke2026pixdlm,
title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
Acknowledgements
This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning segmentation ecosystem. Please follow the licenses of all upstream components.