Instructions to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Thinking")
model = PeftModel.from_pretrained(base_model, "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")

Transformers

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain

SGLang

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Docker Model Runner:
```
docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain
```

Qwen3-VL-4B-Thinking — SpatialChain LoRA Adapter

A LoRA adapter for Qwen3-VL-4B-Thinking fine-tuned on the SpatialChain-Benchmark dataset. The model learns to produce scene-graph-grounded chain-of-thought reasoning for binary spatial visual questions, structured as:

<think>
[step-by-step spatial reasoning]
</think>
<answer>
yes / no
</answer>

Model Details

Field	Value
Base model	Qwen/Qwen3-VL-4B-Thinking
Adapter type	LoRA (PEFT)
Training data	SpatialChain-Benchmark train split (28,350 examples)
Task	Binary spatial VQA with chain-of-thought
Language	English
License	Apache 2.0

Quick Start

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
from PIL import Image
import torch

base   = "Qwen/Qwen3-VL-4B-Thinking"
adapter = "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"

processor = AutoProcessor.from_pretrained(base, trust_remote_code=True)
model     = AutoModelForVision2Seq.from_pretrained(
    base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

image = Image.open("your_image.jpg").convert("RGB")

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": (
            "Your task:\n"
            "1. Analyze the image carefully.\n"
            "2. Provide concise reasoning grounded in visible evidence from the image.\n"
            "3. End your response with 'Answer: <one short sentence>'."
        )}],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text",  "text": "Is there a fence to the left of the person?"},
        ],
    },
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)

with torch.inference_mode():
    ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
    )

print(processor.tokenizer.decode(ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With 4-bit quantization (lower VRAM)

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForVision2Seq.from_pretrained(
    base, quantization_config=bnb, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)

Training Details

Dataset

SpatialChain-Benchmark — 28,350 training examples pairing spatially-oriented GQA questions with scene-graph-grounded reasoning chains. Questions cover 11 spatial relation types (left_of, right_of, above, behind, near, inside, …); chains were generated with Claude Haiku 4.5 (extended thinking) and retained only when the generated answer matched the GQA ground truth.

Each training example target:

<think>
Looking at the image, let me trace through this step-by-step:
(1) Locating the knife — I can see a knife on the left side of the plate.
(2) Finding the bread to the right of the knife — there is a large piece of bread ...
(3) Examining what is to the right of that bread — gray birds are standing on the plate.
(4) Looking for kittens — I do not see any kittens anywhere in the image.
</think>
<answer>
No, there is a bird to the right of the bread.
</answer>

Hyperparameters

Hyperparameter	Value
Base model	Qwen3-VL-4B-Thinking
Quantization	4-bit NF4 (BitsAndBytes)
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
RSLoRA	✓
Target modules	all-linear
Modules to save	`lm_head`, `embed_tokens`
Epochs	2
Per-device batch size	4
Gradient accumulation	3 (effective batch = 12)
Learning rate	3 × 10⁻⁵
LR schedule	cosine
Warmup ratio	0.05
Max sequence length	32,768
Image max size	640 px
Optimizer	AdamW fused
Hardware	1 × A100 80 GB
Training framework	HuggingFace Transformers + PEFT

Evaluation

SpatialChain test set (n = 899)

Evaluation uses two complementary axes. Axis 1 measures VQA accuracy (exact match after normalisation). Axis 2 uses a scene-graph-aware LLM judge scoring reasoning faithfulness and completeness independently of the final answer — see the evaluation code for the full judge protocol.

Metric	Base (4B)	This model (4B FT)
VQA Accuracy	78.44%	82.23%
Macro F1	82.01%	86.67%
Yes-accuracy	77.74%	91.34%
No-accuracy	79.64%	66.57%
ROUGE-1 vs. reference chain	0.403	0.657
Token F1 vs. reference chain	0.392	0.646
Reasoning faithfulness (judge)	0.585	0.631
Reasoning completeness (judge)	0.658	0.708
Pass rate	77.6%	80.2%
Shortcut rate ↓	26.4%	19.4%

Shortcut rate = fraction of correct answers where the judge scores reasoning faithfulness < 0.5. Lower is better.

External benchmarks

SFT on SpatialChain improves in-domain performance but introduces a stylistic specialisation effect on out-of-distribution benchmarks — the model adopts the SpatialChain chain format even when the input distribution differs. Replay-augmented training is recommended to mitigate this.

Benchmark	Base	Fine-tuned	Δ
SpatialChain test	78.4%	82.2%	+3.8 pp
FlagEval/ERQA	45.3%	38.0%	−7.3 pp
FlagEval/EmbSpatial-Bench	79.1%	75.7%	−3.4 pp

Intended Use

Spatial VQA — binary yes/no questions about object positions and relations in images
Reasoning audit — producing interpretable spatial chains that can be verified against scene structure
Research — studying the relationship between chain-of-thought quality and answer correctness in VLMs

Out-of-Scope Use

Tasks requiring metric depth or 3D reasoning (scene graphs are symbolic, not metric)
Open-ended image captioning or generation
Non-English inputs

Bias and Limitations

Yes-bias — the fine-tuned model exhibits a larger yes/no accuracy gap (+24.8 pp) than the base model (+1.9 pp), consistent with the 58% yes-rate in training data. Evaluation should report Yes-acc and No-acc separately.
Stylistic specialisation — the model adopts a fixed reasoning format ("Looking at the image, let me trace through this step-by-step…") on all inputs, which may degrade performance on benchmarks with different prompt styles.
GQA domain — training images are sourced from GQA (Visual Genome); performance on non-natural-image domains is unknown.
Projective bias — 62.7% of training examples involve left_of / right_of relations; depth-ordered relations (close, far) are underrepresented.

Citation

@article{spatialchain2026,
  title   = {SpatialChain: A Benchmark for Auditing Spatial Reasoning Faithfulness in VLMs},
  author  = {Anonymous},
  journal = {Under review at NeurIPS 2026},
  year    = {2026}
}

Environmental Impact

Training ran for approximately 5 hours on a single A100 80 GB GPU (cloud instance). Carbon emissions can be estimated with the ML Impact Calculator.

Downloads last month: 3

Model tree for spatialchain/Qwen3-VL-4B-Thinking-SpatialChain

Base model

Qwen/Qwen3-VL-4B-Thinking

Adapter

(3)

this model

spatialchain
/

Qwen3-VL-4B-Thinking-SpatialChain