WhynotHug
/

PixDLM

Image Segmentation

compute-transparency

reasoning-segmentation

vision-language

Model card Files Files and versions

PixDLM / docs /REPRODUCTION.md

WhynotHug's picture

Upload folder using huggingface_hub

3334467 verified 7 days ago

|

History Blame Contribute Delete

2.17 kB

	# Reproduction Guide

	This guide documents the recommended reproduction path for PixDLM on DRSeg.

	## Setup

	```bash
	conda create -n pixdlm python=3.10 -y
	conda activate pixdlm
	pip install -r requirements.txt
	pip install flash-attn --no-build-isolation
	```

	Download assets:

	```bash
	python scripts/download_assets.py --output-dir .
	python scripts/prepare_drseg.py --data-root data/DRSeg
	```

	## Evaluation

	Single-GPU exact split evaluation:

	```bash
	bash scripts/eval_drseg.sh \
	--gpus 0 \
	--model pretrained/pixdlm-7b \
	--data data/DRSeg \
	--clip checkpoints/clip-vit-large-patch14 \
	--exp pixdlm_drseg_test_single_gpu
	```

	Multi-GPU faster evaluation:

	```bash
	bash scripts/eval_drseg.sh \
	--gpus 0,1,2,3,4,5,6,7 \
	--model pretrained/pixdlm-7b \
	--data data/DRSeg \
	--clip checkpoints/clip-vit-large-patch14 \
	--exp pixdlm_drseg_test_8gpu
	```

	Note: the default PyTorch distributed sampler pads samples when the split size is
	not divisible by the number of GPUs. For exact paper-table accounting, prefer the
	single-GPU command or patch the sampler to remove padded duplicates.

	## Expected Metrics

	Paper metrics on DRSeg test:

	\| Reasoning type \| gIoU \| cIoU \|
	\| --- \| ---: \| ---: \|
	\| Attribute \| 62.80 \| 62.84 \|
	\| Scene \| 61.75 \| 64.03 \|
	\| Spatial \| 62.51 \| 62.80 \|

	The released scripts print:

	- overall gIoU/cIoU,
	- CoT vs no-CoT threshold counts,
	- per-reasoning-type gIoU/cIoU,
	- image-level visualizations in `outputs/<exp>/`.

	For each evaluated sample, the visualization directory stores the input image,
	predicted mask, ground-truth mask, overlay, and a JSON result containing the
	question, answer, and mask metadata.

	## Compute Transparency

	The full test evaluation is memory-heavy because PixDLM combines a language
	model, CLIP visual features, and segmentation decoding. We recommend reporting:

	- GPU type and count,
	- precision,
	- dependency versions,
	- exact split and sampler behavior,
	- average seconds per image,
	- whether CoT text is included in the conditioning input.

	The public release acknowledges the 2027 CVPR Compute Transparency Champion
	recognition and keeps this guide explicit about evaluation assumptions.