Instructions to use WhynotHug/PixDLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WhynotHug/PixDLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="WhynotHug/PixDLM")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("WhynotHug/PixDLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: image-segmentation
datasets:
- WhynotHug/DRSeg
tags:
- pixdlm
- cvpr-2026
- compute-transparency
- reasoning-segmentation
- uav
- remote-sensing
- vision-language
- image-segmentation
γCVPR2026 HighlightγPixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Highlights
- UAV reasoning segmentation: instruction following, visual reasoning, and pixel-level segmentation in high-resolution aerial scenes.
- DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs.
- Dual-path visual design: a language-aligned semantic path and a pixel-level perception path for small-object UAV scenes.
- Reproducible scripts for setup, evaluation, and training.
Repository Layout
PixDLM/
βββ model/ # PixDLM, LLaVA, and segmentation modules
βββ utils/ # DRSeg and referring-segmentation data utilities
βββ configs/ # CLIP preprocessing configs
βββ scripts/ # Setup, evaluation, training, release helpers
βββ docs/ # Reproducibility and data/model documentation
βββ examples/ # Minimal local examples and expected file layout
βββ pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files
βββ release/huggingface/ # Model card, dataset card, Space, upload helpers
βββ eval.py # Evaluation entry point
βββ train_ds.py # Training entry point
βββ requirements.txt
Large files are intentionally not committed to this source tree. Download them with the commands below.
Quick Start
1. Create Environment
git clone https://huggingface.co/WhynotHug/PixDLM
cd PixDLM
conda create -n pixdlm python=3.10 -y
conda activate pixdlm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
2. Download Model and Data
python scripts/download_assets.py \
--model-repo WhynotHug/PixDLM \
--dataset-repo WhynotHug/DRSeg \
--output-dir .
Expected local layout:
PixDLM/
βββ pretrained/
β βββ pixdlm-7b/ # PixDLM HF model snapshot
βββ checkpoints/
β βββ clip-vit-large-patch14/ # CLIP vision tower
β βββ sam2_checkpoints/ # Optional SAM/SAM2 checkpoints
βββ data/
βββ DRSeg/
βββ DRtrain/
βββ DRval/
βββ DRtest/
βββ label/
If your DRSeg archive uses label/ instead of labels/, run:
python scripts/prepare_drseg.py --data-root data/DRSeg
3. Run Evaluation
Single GPU:
bash scripts/eval_drseg.sh \
--gpus 0 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14
Multi-GPU:
bash scripts/eval_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--model pretrained/pixdlm-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_drseg_test
Important: PyTorch DistributedSampler(drop_last=False) pads the test set when
the sample count is not divisible by the number of ranks. For exact paper-table
reproduction on DRSeg test, run single GPU or use a no-padding sampler.
4. Run Training
bash scripts/train_drseg.sh \
--gpus 0,1,2,3,4,5,6,7 \
--base-model checkpoints/llava-v1.6-vicuna-7b \
--data data/DRSeg \
--clip checkpoints/clip-vit-large-patch14 \
--exp pixdlm_train
Input and Output
PixDLM takes one UAV image and one reasoning-oriented question as input. The question identifies a target through spatial, visual-attribute, or scene-context reasoning, for example "Which vehicle is closest to the intersection and may affect traffic flow?"
The model outputs a textual answer and a target segmentation mask. Evaluation also writes visualizations and metrics to:
outputs/<exp_name>/<dataset_name>/with_cot/
logs/<exp_name>/
Typical per-sample artifacts include the input image, predicted mask, ground truth mask, red/green overlay, and a JSON result file.
Dataset
DRSeg is available at:
It contains 10,000 UAV images with instance masks and reasoning annotations:
- Train: 2,999 samples
- Validation: 2,000 samples
- Test: 5,001 samples
Reasoning types are balanced across spatial, attribute, and scene-level reasoning.
Citation
@inproceedings{ke2026pixdlm,
title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation},
author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
Acknowledgements
This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning segmentation ecosystem. Please follow the licenses of all upstream components.