Instructions to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Thinking")
model = PeftModel.from_pretrained(base_model, "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")

Transformers

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="spatialchain/Qwen3-VL-4B-Thinking-SpatialChain")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain

SGLang

How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Docker Model Runner:
```
docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain
```

spatialchain commited on 22 days ago

Commit

1bbc95c

verified ·

1 Parent(s): a2a4a1f

Update README.md

Browse files

Files changed (1) hide show

README.md +199 -165

README.md CHANGED Viewed

@@ -1,207 +1,241 @@
 ---
 base_model: Qwen/Qwen3-VL-4B-Thinking
 library_name: peft
-pipeline_tag: text-generation
 tags:
-- base_model:adapter:Qwen/Qwen3-VL-4B-Thinking
-- lora
-- transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.19.1

 ---
 base_model: Qwen/Qwen3-VL-4B-Thinking
 library_name: peft
+pipeline_tag: image-text-to-text
 tags:
+  - base_model:adapter:Qwen/Qwen3-VL-4B-Thinking
+  - lora
+  - peft
+  - transformers
+  - spatial-reasoning
+  - visual-question-answering
+  - chain-of-thought
+license: apache-2.0
+datasets:
+  - spatialchain/SpatialChain-Benchmark
+language:
+  - en
 ---
+# Qwen3-VL-4B-Thinking — SpatialChain LoRA Adapter
+A LoRA adapter for **Qwen3-VL-4B-Thinking** fine-tuned on the [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) dataset. The model learns to produce **scene-graph-grounded chain-of-thought reasoning** for binary spatial visual questions, structured as:
+```
+<think>
+[step-by-step spatial reasoning]
+</think>
+<answer>
+yes / no
+</answer>
+```
+> An 8B variant is also available: [spatialchain/Qwen3-VL-8B-Thinking-SpatialChain](https://huggingface.co/spatialchain/Qwen3-VL-8B-Thinking-SpatialChain)
+---
+## Model Details
+| Field | Value |
+|-------|-------|
+| **Base model** | [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) |
+| **Adapter type** | LoRA (PEFT) |
+| **Training data** | [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) train split (28,350 examples) |
+| **Task** | Binary spatial VQA with chain-of-thought |
+| **Language** | English |
+| **License** | Apache 2.0 |
+---
+## Quick Start
+```python
+from transformers import AutoProcessor, AutoModelForVision2Seq
+from peft import PeftModel
+from PIL import Image
+import torch
+base   = "Qwen/Qwen3-VL-4B-Thinking"
+adapter = "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"
+processor = AutoProcessor.from_pretrained(base, trust_remote_code=True)
+model     = AutoModelForVision2Seq.from_pretrained(
+    base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
+)
+model = PeftModel.from_pretrained(model, adapter)
+model.eval()
+image = Image.open("your_image.jpg").convert("RGB")
+messages = [
+    {
+        "role": "system",
+        "content": [{"type": "text", "text": (
+            "Your task:\n"
+            "1. Analyze the image carefully.\n"
+            "2. Provide concise reasoning grounded in visible evidence from the image.\n"
+            "3. End your response with 'Answer: <one short sentence>'."
+        )}],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text",  "text": "Is there a fence to the left of the person?"},
+        ],
+    },
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
+with torch.inference_mode():
+    ids = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        do_sample=True,
+        temperature=0.6,
+        top_p=0.95,
+        top_k=20,
+    )
+print(processor.tokenizer.decode(ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
+```
+### With 4-bit quantization (lower VRAM)
+```python
+from transformers import BitsAndBytesConfig
+bnb = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+)
+model = AutoModelForVision2Seq.from_pretrained(
+    base, quantization_config=bnb, device_map="auto", trust_remote_code=True
+)
+model = PeftModel.from_pretrained(model, adapter)
+```
+---
 ## Training Details
+### Dataset
+[SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) — 28,350 training examples pairing spatially-oriented GQA questions with scene-graph-grounded reasoning chains. Questions cover 11 spatial relation types (`left_of`, `right_of`, `above`, `behind`, `near`, `inside`, …); chains were generated with Claude Haiku 4.5 (extended thinking) and retained only when the generated answer matched the GQA ground truth.
+Each training example target:
+```
+<think>
+Looking at the image, let me trace through this step-by-step:
+(1) Locating the knife — I can see a knife on the left side of the plate.
+(2) Finding the bread to the right of the knife — there is a large piece of bread ...
+(3) Examining what is to the right of that bread — gray birds are standing on the plate.
+(4) Looking for kittens — I do not see any kittens anywhere in the image.
+</think>
+<answer>
+No, there is a bird to the right of the bread.
+</answer>
+```
+### Hyperparameters
+| Hyperparameter | Value |
+|----------------|-------|
+| Base model | Qwen3-VL-4B-Thinking |
+| Quantization | 4-bit NF4 (BitsAndBytes) |
+| LoRA rank (r) | 16 |
+| LoRA alpha | 32 |
+| LoRA dropout | 0.05 |
+| RSLoRA | ✓ |
+| Target modules | all-linear |
+| Modules to save | `lm_head`, `embed_tokens` |
+| Epochs | 2 |
+| Per-device batch size | 4 |
+| Gradient accumulation | 3 (effective batch = 12) |
+| Learning rate | 3 × 10⁻⁵ |
+| LR schedule | cosine |
+| Warmup ratio | 0.05 |
+| Max sequence length | 32,768 |
+| Image max size | 640 px |
+| Optimizer | AdamW fused |
+| Hardware | 1 × A100 80 GB |
+| Training framework | HuggingFace Transformers + PEFT |
+---
 ## Evaluation
+### SpatialChain test set (n = 899)
+Evaluation uses two complementary axes. **Axis 1** measures VQA accuracy (exact match after normalisation). **Axis 2** uses a scene-graph-aware LLM judge scoring reasoning faithfulness and completeness independently of the final answer — see the [evaluation code](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) for the full judge protocol.
+| Metric | Base (4B) | **This model (4B FT)** |
+|--------|-----------|------------------------|
+| VQA Accuracy | 78.44% | **82.23%** |
+| Macro F1 | 82.01% | **86.67%** |
+| Yes-accuracy | 77.74% | 91.34% |
+| No-accuracy | 79.64% | 66.57% |
+| ROUGE-1 vs. reference chain | 0.403 | **0.657** |
+| Token F1 vs. reference chain | 0.392 | **0.646** |
+| Reasoning faithfulness (judge) | 0.585 | **0.631** |
+| Reasoning completeness (judge) | 0.658 | **0.708** |
+| Pass rate | 77.6% | **80.2%** |
+| Shortcut rate ↓ | 26.4% | **19.4%** |
+**Shortcut rate** = fraction of *correct* answers where the judge scores reasoning faithfulness < 0.5. Lower is better.
+### External benchmarks
+SFT on SpatialChain improves in-domain performance but introduces a **stylistic specialisation effect** on out-of-distribution benchmarks — the model adopts the SpatialChain chain format even when the input distribution differs. Replay-augmented training is recommended to mitigate this.
+| Benchmark | Base | Fine-tuned | Δ |
+|-----------|------|------------|---|
+| SpatialChain test | 78.4% | **82.2%** | +3.8 pp |
+| [FlagEval/ERQA](https://huggingface.co/datasets/FlagEval/ERQA) | 45.3% | 38.0% | −7.3 pp |
+| [FlagEval/EmbSpatial-Bench](https://huggingface.co/datasets/FlagEval/EmbSpatial-Bench) | 79.1% | 75.7% | −3.4 pp |
+---
+## Intended Use
+- **Spatial VQA** — binary yes/no questions about object positions and relations in images
+- **Reasoning audit** — producing interpretable spatial chains that can be verified against scene structure
+- **Research** — studying the relationship between chain-of-thought quality and answer correctness in VLMs
+## Out-of-Scope Use
+- Tasks requiring metric depth or 3D reasoning (scene graphs are symbolic, not metric)
+- Open-ended image captioning or generation
+- Non-English inputs
+## Bias and Limitations
+- **Yes-bias** — the fine-tuned model exhibits a larger yes/no accuracy gap (+24.8 pp) than the base model (+1.9 pp), consistent with the 58% yes-rate in training data. Evaluation should report Yes-acc and No-acc separately.
+- **Stylistic specialisation** — the model adopts a fixed reasoning format ("Looking at the image, let me trace through this step-by-step…") on all inputs, which may degrade performance on benchmarks with different prompt styles.
+- **GQA domain** — training images are sourced from GQA (Visual Genome); performance on non-natural-image domains is unknown.
+- **Projective bias** — 62.7% of training examples involve `left_of` / `right_of` relations; depth-ordered relations (`close`, `far`) are underrepresented.
+---
+## Citation
+```bibtex
+@article{spatialchain2026,
+  title   = {SpatialChain: A Benchmark for Auditing Spatial Reasoning Faithfulness in VLMs},
+  author  = {Anonymous},
+  journal = {Under review at NeurIPS 2026},
+  year    = {2026}
+}
+```
+---
 ## Environmental Impact
+Training ran for approximately **5 hours** on a single **A100 80 GB** GPU (cloud instance). Carbon emissions can be estimated with the [ML Impact Calculator](https://mlco2.github.io/impact#compute).