--- license: cc-by-nc-4.0 library_name: transformers pipeline_tag: image-segmentation datasets: - WhynotHug/DRSeg tags: - pixdlm - cvpr-2026 - compute-transparency - reasoning-segmentation - uav - remote-sensing - vision-language - image-segmentation --- # 【CVPR2026 Highlight】PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation [![Paper](https://img.shields.io/badge/arXiv-2604.15670-b31b1b)](https://arxiv.org/abs/2604.15670) [![Model](https://img.shields.io/badge/HuggingFace-WhynotHug%2FPixDLM-yellow)](https://huggingface.co/WhynotHug/PixDLM) [![Dataset](https://img.shields.io/badge/HuggingFace-WhynotHug%2FDRSeg-green)](https://huggingface.co/datasets/WhynotHug/DRSeg) [![Project Page](https://img.shields.io/badge/Project-PixDLM-blue)](https://huggingface.co/spaces/WhynotHug/PixDLM) ## Highlights - UAV reasoning segmentation: instruction following, visual reasoning, and pixel-level segmentation in high-resolution aerial scenes. - DRSeg benchmark: 10,000 UAV images with instance masks and reasoning QA pairs. - Dual-path visual design: a language-aligned semantic path and a pixel-level perception path for small-object UAV scenes. - Reproducible scripts for setup, evaluation, and training. ## Repository Layout ```text PixDLM/ ├── model/ # PixDLM, LLaVA, and segmentation modules ├── utils/ # DRSeg and referring-segmentation data utilities ├── configs/ # CLIP preprocessing configs ├── scripts/ # Setup, evaluation, training, release helpers ├── docs/ # Reproducibility and data/model documentation ├── examples/ # Minimal local examples and expected file layout ├── pretrained/pixdlm-7b/ # Lightweight model config/tokenizer files ├── release/huggingface/ # Model card, dataset card, Space, upload helpers ├── eval.py # Evaluation entry point ├── train_ds.py # Training entry point └── requirements.txt ``` Large files are intentionally not committed to this source tree. Download them with the commands below. ## Quick Start ### 1. Create Environment ```bash git clone https://huggingface.co/WhynotHug/PixDLM cd PixDLM conda create -n pixdlm python=3.10 -y conda activate pixdlm pip install -r requirements.txt pip install flash-attn --no-build-isolation ``` ### 2. Download Model and Data ```bash python scripts/download_assets.py \ --model-repo WhynotHug/PixDLM \ --dataset-repo WhynotHug/DRSeg \ --output-dir . ``` Expected local layout: ```text PixDLM/ ├── pretrained/ │ └── pixdlm-7b/ # PixDLM HF model snapshot ├── checkpoints/ │ ├── clip-vit-large-patch14/ # CLIP vision tower │ └── sam2_checkpoints/ # Optional SAM/SAM2 checkpoints └── data/ └── DRSeg/ ├── DRtrain/ ├── DRval/ ├── DRtest/ └── label/ ``` If your DRSeg archive uses `label/` instead of `labels/`, run: ```bash python scripts/prepare_drseg.py --data-root data/DRSeg ``` ### 3. Run Evaluation Single GPU: ```bash bash scripts/eval_drseg.sh \ --gpus 0 \ --model pretrained/pixdlm-7b \ --data data/DRSeg \ --clip checkpoints/clip-vit-large-patch14 ``` Multi-GPU: ```bash bash scripts/eval_drseg.sh \ --gpus 0,1,2,3,4,5,6,7 \ --model pretrained/pixdlm-7b \ --data data/DRSeg \ --clip checkpoints/clip-vit-large-patch14 \ --exp pixdlm_drseg_test ``` Important: PyTorch `DistributedSampler(drop_last=False)` pads the test set when the sample count is not divisible by the number of ranks. For exact paper-table reproduction on DRSeg test, run single GPU or use a no-padding sampler. ### 4. Run Training ```bash bash scripts/train_drseg.sh \ --gpus 0,1,2,3,4,5,6,7 \ --base-model checkpoints/llava-v1.6-vicuna-7b \ --data data/DRSeg \ --clip checkpoints/clip-vit-large-patch14 \ --exp pixdlm_train ``` ## Input and Output PixDLM takes one UAV image and one reasoning-oriented question as input. The question identifies a target through spatial, visual-attribute, or scene-context reasoning, for example "Which vehicle is closest to the intersection and may affect traffic flow?" The model outputs a textual answer and a target segmentation mask. Evaluation also writes visualizations and metrics to: ```text outputs///with_cot/ logs// ``` Typical per-sample artifacts include the input image, predicted mask, ground truth mask, red/green overlay, and a JSON result file. ## Dataset DRSeg is available at: - https://huggingface.co/datasets/WhynotHug/DRSeg It contains 10,000 UAV images with instance masks and reasoning annotations: - Train: 2,999 samples - Validation: 2,000 samples - Test: 5,001 samples Reasoning types are balanced across spatial, attribute, and scene-level reasoning. ## Citation ```bibtex @inproceedings{ke2026pixdlm, title={PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation}, author={Ke, Shuyan and Mei, Yifan and Wu, Changli and Zheng, Yonghan and Ji, Jiayi and Cao, Liujuan and Ji, Rongrong}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2026} } ``` ## Acknowledgements This project builds on LLaVA, CLIP, SAM/SAM2, and the broader reasoning segmentation ecosystem. Please follow the licenses of all upstream components.