Image-Text-to-Text
PEFT
Safetensors
Transformers
English
lora
spatial-reasoning
visual-question-answering
chain-of-thought
conversational
Instructions to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Thinking") model = PeftModel.from_pretrained(base_model, "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain") - Transformers
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="spatialchain/Qwen3-VL-4B-Thinking-SpatialChain") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain
- SGLang
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use spatialchain/Qwen3-VL-4B-Thinking-SpatialChain with Docker Model Runner:
docker model run hf.co/spatialchain/Qwen3-VL-4B-Thinking-SpatialChain
| base_model: Qwen/Qwen3-VL-4B-Thinking | |
| library_name: peft | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - base_model:adapter:Qwen/Qwen3-VL-4B-Thinking | |
| - lora | |
| - peft | |
| - transformers | |
| - spatial-reasoning | |
| - visual-question-answering | |
| - chain-of-thought | |
| license: apache-2.0 | |
| datasets: | |
| - spatialchain/SpatialChain-Benchmark | |
| language: | |
| - en | |
| # Qwen3-VL-4B-Thinking β SpatialChain LoRA Adapter | |
| A LoRA adapter for **Qwen3-VL-4B-Thinking** fine-tuned on the [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) dataset. The model learns to produce **scene-graph-grounded chain-of-thought reasoning** for binary spatial visual questions, structured as: | |
| ``` | |
| <think> | |
| [step-by-step spatial reasoning] | |
| </think> | |
| <answer> | |
| yes / no | |
| </answer> | |
| ``` | |
| --- | |
| ## Model Details | |
| | Field | Value | | |
| |-------|-------| | |
| | **Base model** | [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking) | | |
| | **Adapter type** | LoRA (PEFT) | | |
| | **Training data** | [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) train split (28,350 examples) | | |
| | **Task** | Binary spatial VQA with chain-of-thought | | |
| | **Language** | English | | |
| | **License** | Apache 2.0 | | |
| --- | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| from peft import PeftModel | |
| from PIL import Image | |
| import torch | |
| base = "Qwen/Qwen3-VL-4B-Thinking" | |
| adapter = "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain" | |
| processor = AutoProcessor.from_pretrained(base, trust_remote_code=True) | |
| model = AutoModelForVision2Seq.from_pretrained( | |
| base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True | |
| ) | |
| model = PeftModel.from_pretrained(model, adapter) | |
| model.eval() | |
| image = Image.open("your_image.jpg").convert("RGB") | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": [{"type": "text", "text": ( | |
| "Your task:\n" | |
| "1. Analyze the image carefully.\n" | |
| "2. Provide concise reasoning grounded in visible evidence from the image.\n" | |
| "3. End your response with 'Answer: <one short sentence>'." | |
| )}], | |
| }, | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": "Is there a fence to the left of the person?"}, | |
| ], | |
| }, | |
| ] | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device) | |
| with torch.inference_mode(): | |
| ids = model.generate( | |
| **inputs, | |
| max_new_tokens=512, | |
| do_sample=True, | |
| temperature=0.6, | |
| top_p=0.95, | |
| top_k=20, | |
| ) | |
| print(processor.tokenizer.decode(ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### With 4-bit quantization (lower VRAM) | |
| ```python | |
| from transformers import BitsAndBytesConfig | |
| bnb = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| bnb_4bit_use_double_quant=True, | |
| ) | |
| model = AutoModelForVision2Seq.from_pretrained( | |
| base, quantization_config=bnb, device_map="auto", trust_remote_code=True | |
| ) | |
| model = PeftModel.from_pretrained(model, adapter) | |
| ``` | |
| --- | |
| ## Training Details | |
| ### Dataset | |
| [SpatialChain-Benchmark](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) β 28,350 training examples pairing spatially-oriented GQA questions with scene-graph-grounded reasoning chains. Questions cover 11 spatial relation types (`left_of`, `right_of`, `above`, `behind`, `near`, `inside`, β¦); chains were generated with Claude Haiku 4.5 (extended thinking) and retained only when the generated answer matched the GQA ground truth. | |
| Each training example target: | |
| ``` | |
| <think> | |
| Looking at the image, let me trace through this step-by-step: | |
| (1) Locating the knife β I can see a knife on the left side of the plate. | |
| (2) Finding the bread to the right of the knife β there is a large piece of bread ... | |
| (3) Examining what is to the right of that bread β gray birds are standing on the plate. | |
| (4) Looking for kittens β I do not see any kittens anywhere in the image. | |
| </think> | |
| <answer> | |
| No, there is a bird to the right of the bread. | |
| </answer> | |
| ``` | |
| ### Hyperparameters | |
| | Hyperparameter | Value | | |
| |----------------|-------| | |
| | Base model | Qwen3-VL-4B-Thinking | | |
| | Quantization | 4-bit NF4 (BitsAndBytes) | | |
| | LoRA rank (r) | 16 | | |
| | LoRA alpha | 32 | | |
| | LoRA dropout | 0.05 | | |
| | RSLoRA | β | | |
| | Target modules | all-linear | | |
| | Modules to save | `lm_head`, `embed_tokens` | | |
| | Epochs | 2 | | |
| | Per-device batch size | 4 | | |
| | Gradient accumulation | 3 (effective batch = 12) | | |
| | Learning rate | 3 Γ 10β»β΅ | | |
| | LR schedule | cosine | | |
| | Warmup ratio | 0.05 | | |
| | Max sequence length | 32,768 | | |
| | Image max size | 640 px | | |
| | Optimizer | AdamW fused | | |
| | Hardware | 1 Γ A100 80 GB | | |
| | Training framework | HuggingFace Transformers + PEFT | | |
| --- | |
| ## Evaluation | |
| ### SpatialChain test set (n = 899) | |
| Evaluation uses two complementary axes. **Axis 1** measures VQA accuracy (exact match after normalisation). **Axis 2** uses a scene-graph-aware LLM judge scoring reasoning faithfulness and completeness independently of the final answer β see the [evaluation code](https://huggingface.co/datasets/spatialchain/SpatialChain-Benchmark) for the full judge protocol. | |
| | Metric | Base (4B) | **This model (4B FT)** | | |
| |--------|-----------|------------------------| | |
| | VQA Accuracy | 78.44% | **82.23%** | | |
| | Macro F1 | 82.01% | **86.67%** | | |
| | Yes-accuracy | 77.74% | 91.34% | | |
| | No-accuracy | 79.64% | 66.57% | | |
| | ROUGE-1 vs. reference chain | 0.403 | **0.657** | | |
| | Token F1 vs. reference chain | 0.392 | **0.646** | | |
| | Reasoning faithfulness (judge) | 0.585 | **0.631** | | |
| | Reasoning completeness (judge) | 0.658 | **0.708** | | |
| | Pass rate | 77.6% | **80.2%** | | |
| | Shortcut rate β | 26.4% | **19.4%** | | |
| **Shortcut rate** = fraction of *correct* answers where the judge scores reasoning faithfulness < 0.5. Lower is better. | |
| ### External benchmarks | |
| SFT on SpatialChain improves in-domain performance but introduces a **stylistic specialisation effect** on out-of-distribution benchmarks β the model adopts the SpatialChain chain format even when the input distribution differs. Replay-augmented training is recommended to mitigate this. | |
| | Benchmark | Base | Fine-tuned | Ξ | | |
| |-----------|------|------------|---| | |
| | SpatialChain test | 78.4% | **82.2%** | +3.8 pp | | |
| | [FlagEval/ERQA](https://huggingface.co/datasets/FlagEval/ERQA) | 45.3% | 38.0% | β7.3 pp | | |
| | [FlagEval/EmbSpatial-Bench](https://huggingface.co/datasets/FlagEval/EmbSpatial-Bench) | 79.1% | 75.7% | β3.4 pp | | |
| --- | |
| ## Intended Use | |
| - **Spatial VQA** β binary yes/no questions about object positions and relations in images | |
| - **Reasoning audit** β producing interpretable spatial chains that can be verified against scene structure | |
| - **Research** β studying the relationship between chain-of-thought quality and answer correctness in VLMs | |
| ## Out-of-Scope Use | |
| - Tasks requiring metric depth or 3D reasoning (scene graphs are symbolic, not metric) | |
| - Open-ended image captioning or generation | |
| - Non-English inputs | |
| ## Bias and Limitations | |
| - **Yes-bias** β the fine-tuned model exhibits a larger yes/no accuracy gap (+24.8 pp) than the base model (+1.9 pp), consistent with the 58% yes-rate in training data. Evaluation should report Yes-acc and No-acc separately. | |
| - **Stylistic specialisation** β the model adopts a fixed reasoning format ("Looking at the image, let me trace through this step-by-stepβ¦") on all inputs, which may degrade performance on benchmarks with different prompt styles. | |
| - **GQA domain** β training images are sourced from GQA (Visual Genome); performance on non-natural-image domains is unknown. | |
| - **Projective bias** β 62.7% of training examples involve `left_of` / `right_of` relations; depth-ordered relations (`close`, `far`) are underrepresented. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @article{spatialchain2026, | |
| title = {SpatialChain: A Benchmark for Auditing Spatial Reasoning Faithfulness in VLMs}, | |
| author = {Anonymous}, | |
| journal = {Under review at NeurIPS 2026}, | |
| year = {2026} | |
| } | |
| ``` | |
| --- | |
| ## Environmental Impact | |
| Training ran for approximately **5 hours** on a single **A100 80 GB** GPU (cloud instance). Carbon emissions can be estimated with the [ML Impact Calculator](https://mlco2.github.io/impact#compute). | |