Image Segmentation
Transformers
PyTorch
pixdlm
cvpr-2026
compute-transparency
reasoning-segmentation
uav
remote-sensing
vision-language
Instructions to use WhynotHug/PixDLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WhynotHug/PixDLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="WhynotHug/PixDLM")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("WhynotHug/PixDLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-4.0 | |
| library_name: transformers | |
| pipeline_tag: image-segmentation | |
| datasets: | |
| - WhynotHug/DRSeg | |
| tags: | |
| - pixdlm | |
| - cvpr-2026 | |
| - compute-transparency | |
| - reasoning-segmentation | |
| - uav | |
| - remote-sensing | |
| - vision-language | |
| - image-segmentation | |
| # γCVPR2026 HighlightγPixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation | |
| [](https://arxiv.org/abs/2604.15670) | |
| [](https://huggingface.co/WhynotHug/PixDLM) | |
| [](https://huggingface.co/datasets/WhynotHug/DRSeg) | |
| [](https://huggingface.co/spaces/WhynotHug/PixDLM) | |
| ## Highlights | |
| - UAV reasoning segmentation: instruction following, visual reasoning, and | |
| pixel-level segmentation in high-resolution aerial scenes. | |
| - DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs. | |
| - Dual-path visual design: a language-aligned semantic path and a pixel-level | |
| perception path for small-object UAV scenes. | |
| - Reproducible scripts for setup, evaluation, and training. | |
| ## Repository Layout | |
| ```text | |
| PixDLM/ | |
| βββ model/ # PixDLM, LLaVA, and segmentation modules | |
| βββ utils/ # DRSeg and referring-segmentation data utilities | |
| βββ configs/ # CLIP preprocessing configs | |
| βββ scripts/ # Setup, evaluation, training, release helpers | |
| βββ docs/ # Reproducibility and data/model documentation | |
| βββ examples/ # Minimal local examples and expected file layout | |
| βββ pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files | |
| βββ release/huggingface/ # Model card, dataset card, Space, upload helpers | |
| βββ eval.py # Evaluation entry point | |
| βββ train_ds.py # Training entry point | |
| βββ requirements.txt | |
| ``` | |
| Large files are intentionally not committed to this source tree. Download them | |
| with the commands below. | |
| ## Quick Start | |
| ### 1. Create Environment | |
| ```bash | |
| git clone https://huggingface.co/WhynotHug/PixDLM | |
| cd PixDLM | |
| conda create -n pixdlm python=3.10 -y | |
| conda activate pixdlm | |
| pip install -r requirements.txt | |
| pip install flash-attn --no-build-isolation | |
| ``` | |
| ### 2. Download Model and Data | |
| ```bash | |
| python scripts/download_assets.py \ | |
| --model-repo WhynotHug/PixDLM \ | |
| --dataset-repo WhynotHug/DRSeg \ | |
| --output-dir . | |
| ``` | |
| Expected local layout: | |
| ```text | |
| PixDLM/ | |
| βββ pretrained/ | |
| β βββ pixdlm-7b/ # PixDLM HF model snapshot | |
| βββ checkpoints/ | |
| β βββ clip-vit-large-patch14/ # CLIP vision tower | |
| β βββ sam2_checkpoints/ # Optional SAM/SAM2 checkpoints | |
| βββ data/ | |
| βββ DRSeg/ | |
| βββ DRtrain/ | |
| βββ DRval/ | |
| βββ DRtest/ | |
| βββ label/ | |
| ``` | |
| If your DRSeg archive uses `label/` instead of `labels/`, run: | |
| ```bash | |
| python scripts/prepare_drseg.py --data-root data/DRSeg | |
| ``` | |
| ### 3. Run Evaluation | |
| Single GPU: | |
| ```bash | |
| bash scripts/eval_drseg.sh \ | |
| --gpus 0 \ | |
| --model pretrained/pixdlm-7b \ | |
| --data data/DRSeg \ | |
| --clip checkpoints/clip-vit-large-patch14 | |
| ``` | |
| Multi-GPU: | |
| ```bash | |
| bash scripts/eval_drseg.sh \ | |
| --gpus 0,1,2,3,4,5,6,7 \ | |
| --model pretrained/pixdlm-7b \ | |
| --data data/DRSeg \ | |
| --clip checkpoints/clip-vit-large-patch14 \ | |
| --exp pixdlm_drseg_test | |
| ``` | |
| Important: PyTorch `DistributedSampler(drop_last=False)` pads the test set when | |
| the sample count is not divisible by the number of ranks. For exact paper-table | |
| reproduction on DRSeg test, run single GPU or use a no-padding sampler. | |
| ### 4. Run Training | |
| ```bash | |
| bash scripts/train_drseg.sh \ | |
| --gpus 0,1,2,3,4,5,6,7 \ | |
| --base-model checkpoints/llava-v1.6-vicuna-7b \ | |
| --data data/DRSeg \ | |
| --clip checkpoints/clip-vit-large-patch14 \ | |
| --exp pixdlm_train | |
| ``` | |
| ## Input and Output | |
| PixDLM takes one UAV image and one reasoning-oriented question as input. The | |
| question identifies a target through spatial, visual-attribute, or scene-context | |
| reasoning, for example "Which vehicle is closest to the intersection and may | |
| affect traffic flow?" | |
| The model outputs a textual answer and a target segmentation mask. Evaluation | |
| also writes visualizations and metrics to: | |
| ```text | |
| outputs/<exp_name>/<dataset_name>/with_cot/ | |
| logs/<exp_name>/ | |
| ``` | |
| Typical per-sample artifacts include the input image, predicted mask, ground | |
| truth mask, red/green overlay, and a JSON result file. | |
| ## Dataset | |
| DRSeg is available at: | |
| - https://huggingface.co/datasets/WhynotHug/DRSeg | |
| It contains 10,000 UAV images with instance masks and reasoning annotations: | |
| - Train: 2,999 samples | |
| - Validation: 2,000 samples | |
| - Test: 5,001 samples | |
| Reasoning types are balanced across spatial, attribute, and scene-level | |
| reasoning. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{ke2026pixdlm, | |
| title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation}, | |
| author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong}, | |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, | |
| year={2026} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning | |
| segmentation ecosystem. Please follow the licenses of all upstream components. | |