Image Segmentation
Transformers
PyTorch
pixdlm
cvpr-2026
compute-transparency
reasoning-segmentation
uav
remote-sensing
vision-language
Instructions to use WhynotHug/PixDLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WhynotHug/PixDLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="WhynotHug/PixDLM")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("WhynotHug/PixDLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Reproduction Guide | |
| This guide documents the recommended reproduction path for PixDLM on DRSeg. | |
| ## Setup | |
| ```bash | |
| conda create -n pixdlm python=3.10 -y | |
| conda activate pixdlm | |
| pip install -r requirements.txt | |
| pip install flash-attn --no-build-isolation | |
| ``` | |
| Download assets: | |
| ```bash | |
| python scripts/download_assets.py --output-dir . | |
| python scripts/prepare_drseg.py --data-root data/DRSeg | |
| ``` | |
| ## Evaluation | |
| Single-GPU exact split evaluation: | |
| ```bash | |
| bash scripts/eval_drseg.sh \ | |
| --gpus 0 \ | |
| --model pretrained/pixdlm-7b \ | |
| --data data/DRSeg \ | |
| --clip checkpoints/clip-vit-large-patch14 \ | |
| --exp pixdlm_drseg_test_single_gpu | |
| ``` | |
| Multi-GPU faster evaluation: | |
| ```bash | |
| bash scripts/eval_drseg.sh \ | |
| --gpus 0,1,2,3,4,5,6,7 \ | |
| --model pretrained/pixdlm-7b \ | |
| --data data/DRSeg \ | |
| --clip checkpoints/clip-vit-large-patch14 \ | |
| --exp pixdlm_drseg_test_8gpu | |
| ``` | |
| Note: the default PyTorch distributed sampler pads samples when the split size is | |
| not divisible by the number of GPUs. For exact paper-table accounting, prefer the | |
| single-GPU command or patch the sampler to remove padded duplicates. | |
| ## Expected Metrics | |
| Paper metrics on DRSeg test: | |
| | Reasoning type | gIoU | cIoU | | |
| | --- | ---: | ---: | | |
| | Attribute | 62.80 | 62.84 | | |
| | Scene | 61.75 | 64.03 | | |
| | Spatial | 62.51 | 62.80 | | |
| The released scripts print: | |
| - overall gIoU/cIoU, | |
| - CoT vs no-CoT threshold counts, | |
| - per-reasoning-type gIoU/cIoU, | |
| - image-level visualizations in `outputs/<exp>/`. | |
| For each evaluated sample, the visualization directory stores the input image, | |
| predicted mask, ground-truth mask, overlay, and a JSON result containing the | |
| question, answer, and mask metadata. | |
| ## Compute Transparency | |
| The full test evaluation is memory-heavy because PixDLM combines a language | |
| model, CLIP visual features, and segmentation decoding. We recommend reporting: | |
| - GPU type and count, | |
| - precision, | |
| - dependency versions, | |
| - exact split and sampler behavior, | |
| - average seconds per image, | |
| - whether CoT text is included in the conditioning input. | |
| The public release acknowledges the 2027 CVPR Compute Transparency Champion | |
| recognition and keeps this guide explicit about evaluation assumptions. | |