Instructions to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Thinking")
model = PeftModel.from_pretrained(base_model, "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")

Transformers

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain

SGLang

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Docker Model Runner:
```
docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain
```

Qwen3-VL-4B-Thinking-SpatialChain / README.md

spatialchain

Update README.md

310346f verified 12 days ago

preview code

raw

history blame contribute delete

8.47 kB

	---
	base_model: Qwen/Qwen3-VL-4B-Thinking
	library_name: peft
	pipeline_tag: image-text-to-text
	tags:
	- base_model:adapter:Qwen/Qwen3-VL-4B-Thinking
	- lora
	- peft
	- transformers
	- spatial-reasoning
	- visual-question-answering
	- chain-of-thought
	license: apache-2.0
	datasets:
	- spatialchain/SpatialChain-Benchmark
	language:
	- en
	---

	# Qwen3-VL-4B-Thinking — SpatialChain LoRA Adapter

	A LoRA adapter for Qwen3-VL-4B-Thinking fine-tuned on the [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) dataset. The model learns to produce scene-graph-grounded chain-of-thought reasoning for binary spatial visual questions, structured as:

	```
	<think>
	[step-by-step spatial reasoning]
	</think>
	<answer>
	yes / no
	</answer>
	```

	---

	## Model Details

	\| Field \| Value \|
	\|-------\|-------\|
	\| Base model \| [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) \|
	\| Adapter type \| LoRA (PEFT) \|
	\| Training data \| [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) train split (28,350 examples) \|
	\| Task \| Binary spatial VQA with chain-of-thought \|
	\| Language \| English \|
	\| License \| Apache 2.0 \|

	---

	## Quick Start

	```python
	from transformers import AutoProcessor, AutoModelForVision2Seq
	from peft import PeftModel
	from PIL import Image
	import torch

	base = "Qwen/Qwen3-VL-4B-Thinking"
	adapter = "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"

	processor = AutoProcessor.from_pretrained(base, trust_remote_code=True)
	model = AutoModelForVision2Seq.from_pretrained(
	base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
	)
	model = PeftModel.from_pretrained(model, adapter)
	model.eval()

	image = Image.open("your_image.jpg").convert("RGB")

	messages = [
	{
	"role": "system",
	"content": [{"type": "text", "text": (
	"Your task:\n"
	"1. Analyze the image carefully.\n"
	"2. Provide concise reasoning grounded in visible evidence from the image.\n"
	"3. End your response with 'Answer: <one short sentence>'."
	)}],
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Is there a fence to the left of the person?"},
	],
	},
	]

	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)

	with torch.inference_mode():
	ids = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.6,
	top_p=0.95,
	top_k=20,
	)

	print(processor.tokenizer.decode(ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### With 4-bit quantization (lower VRAM)

	```python
	from transformers import BitsAndBytesConfig

	bnb = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	)
	model = AutoModelForVision2Seq.from_pretrained(
	base, quantization_config=bnb, device_map="auto", trust_remote_code=True
	)
	model = PeftModel.from_pretrained(model, adapter)
	```

	---

	## Training Details

	### Dataset

	[SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) — 28,350 training examples pairing spatially-oriented GQA questions with scene-graph-grounded reasoning chains. Questions cover 11 spatial relation types (`left_of`, `right_of`, `above`, `behind`, `near`, `inside`, …); chains were generated with Claude Haiku 4.5 (extended thinking) and retained only when the generated answer matched the GQA ground truth.

	Each training example target:
	```
	<think>
	Looking at the image, let me trace through this step-by-step:
	(1) Locating the knife — I can see a knife on the left side of the plate.
	(2) Finding the bread to the right of the knife — there is a large piece of bread ...
	(3) Examining what is to the right of that bread — gray birds are standing on the plate.
	(4) Looking for kittens — I do not see any kittens anywhere in the image.
	</think>
	<answer>
	No, there is a bird to the right of the bread.
	</answer>
	```

	### Hyperparameters

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Base model \| Qwen3-VL-4B-Thinking \|
	\| Quantization \| 4-bit NF4 (BitsAndBytes) \|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| RSLoRA \| ✓ \|
	\| Target modules \| all-linear \|
	\| Modules to save \| `lm_head`, `embed_tokens` \|
	\| Epochs \| 2 \|
	\| Per-device batch size \| 4 \|
	\| Gradient accumulation \| 3 (effective batch = 12) \|
	\| Learning rate \| 3 × 10⁻⁵ \|
	\| LR schedule \| cosine \|
	\| Warmup ratio \| 0.05 \|
	\| Max sequence length \| 32,768 \|
	\| Image max size \| 640 px \|
	\| Optimizer \| AdamW fused \|
	\| Hardware \| 1 × A100 80 GB \|
	\| Training framework \| HuggingFace Transformers + PEFT \|

	---

	## Evaluation

	### SpatialChain test set (n = 899)

	Evaluation uses two complementary axes. Axis 1 measures VQA accuracy (exact match after normalisation). Axis 2 uses a scene-graph-aware LLM judge scoring reasoning faithfulness and completeness independently of the final answer — see the [evaluation code](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) for the full judge protocol.

	\| Metric \| Base (4B) \| This model (4B FT) \|
	\|--------\|-----------\|------------------------\|
	\| VQA Accuracy \| 78.44% \| 82.23% \|
	\| Macro F1 \| 82.01% \| 86.67% \|
	\| Yes-accuracy \| 77.74% \| 91.34% \|
	\| No-accuracy \| 79.64% \| 66.57% \|
	\| ROUGE-1 vs. reference chain \| 0.403 \| 0.657 \|
	\| Token F1 vs. reference chain \| 0.392 \| 0.646 \|
	\| Reasoning faithfulness (judge) \| 0.585 \| 0.631 \|
	\| Reasoning completeness (judge) \| 0.658 \| 0.708 \|
	\| Pass rate \| 77.6% \| 80.2% \|
	\| Shortcut rate ↓ \| 26.4% \| 19.4% \|

	Shortcut rate = fraction of correct answers where the judge scores reasoning faithfulness < 0.5. Lower is better.

	### External benchmarks

	SFT on SpatialChain improves in-domain performance but introduces a stylistic specialisation effect on out-of-distribution benchmarks — the model adopts the SpatialChain chain format even when the input distribution differs. Replay-augmented training is recommended to mitigate this.

	\| Benchmark \| Base \| Fine-tuned \| Δ \|
	\|-----------\|------\|------------\|---\|
	\| SpatialChain test \| 78.4% \| 82.2% \| +3.8 pp \|
	\| [FlagEval/ERQA](https://huggingface.co/datasets/FlagEval/ERQA) \| 45.3% \| 38.0% \| −7.3 pp \|
	\| [FlagEval/EmbSpatial-Bench](https://huggingface.co/datasets/FlagEval/EmbSpatial-Bench) \| 79.1% \| 75.7% \| −3.4 pp \|

	---

	## Intended Use

	- Spatial VQA — binary yes/no questions about object positions and relations in images
	- Reasoning audit — producing interpretable spatial chains that can be verified against scene structure
	- Research — studying the relationship between chain-of-thought quality and answer correctness in VLMs

	## Out-of-Scope Use

	- Tasks requiring metric depth or 3D reasoning (scene graphs are symbolic, not metric)
	- Open-ended image captioning or generation
	- Non-English inputs

	## Bias and Limitations

	- Yes-bias — the fine-tuned model exhibits a larger yes/no accuracy gap (+24.8 pp) than the base model (+1.9 pp), consistent with the 58% yes-rate in training data. Evaluation should report Yes-acc and No-acc separately.
	- Stylistic specialisation — the model adopts a fixed reasoning format ("Looking at the image, let me trace through this step-by-step…") on all inputs, which may degrade performance on benchmarks with different prompt styles.
	- GQA domain — training images are sourced from GQA (Visual Genome); performance on non-natural-image domains is unknown.
	- Projective bias — 62.7% of training examples involve `left_of` / `right_of` relations; depth-ordered relations (`close`, `far`) are underrepresented.

	---

	## Citation

	```bibtex
	@article{spatialchain2026,
	title = {SpatialChain: A Benchmark for Auditing Spatial Reasoning Faithfulness in VLMs},
	author = {Anonymous},
	journal = {Under review at NeurIPS 2026},
	year = {2026}
	}
	```

	---

	## Environmental Impact

	Training ran for approximately 5 hours on a single A100 80 GB GPU (cloud instance). Carbon emissions can be estimated with the [ML Impact Calculator](https://mlco2.github.io/impact#compute).