Qwen3-VL-4B-Thinking β€” SpatialChain LoRA Adapter

A LoRA adapter for Qwen3-VL-4B-Thinking fine-tuned on the SpatialChain-Benchmark dataset. The model learns to produce scene-graph-grounded chain-of-thought reasoning for binary spatial visual questions, structured as:

<think>
[step-by-step spatial reasoning]
</think>
<answer>
yes / no
</answer>

Model Details

Field Value
Base model Qwen/Qwen3-VL-4B-Thinking
Adapter type LoRA (PEFT)
Training data SpatialChain-Benchmark train split (28,350 examples)
Task Binary spatial VQA with chain-of-thought
Language English
License Apache 2.0

Quick Start

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
from PIL import Image
import torch

base   = "Qwen/Qwen3-VL-4B-Thinking"
adapter = "spatialchain/Qwen3-VL-4B-Thinking-SpatialChain"

processor = AutoProcessor.from_pretrained(base, trust_remote_code=True)
model     = AutoModelForVision2Seq.from_pretrained(
    base, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

image = Image.open("your_image.jpg").convert("RGB")

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": (
            "Your task:\n"
            "1. Analyze the image carefully.\n"
            "2. Provide concise reasoning grounded in visible evidence from the image.\n"
            "3. End your response with 'Answer: <one short sentence>'."
        )}],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text",  "text": "Is there a fence to the left of the person?"},
        ],
    },
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)

with torch.inference_mode():
    ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
    )

print(processor.tokenizer.decode(ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With 4-bit quantization (lower VRAM)

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForVision2Seq.from_pretrained(
    base, quantization_config=bnb, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)

Training Details

Dataset

SpatialChain-Benchmark β€” 28,350 training examples pairing spatially-oriented GQA questions with scene-graph-grounded reasoning chains. Questions cover 11 spatial relation types (left_of, right_of, above, behind, near, inside, …); chains were generated with Claude Haiku 4.5 (extended thinking) and retained only when the generated answer matched the GQA ground truth.

Each training example target:

<think>
Looking at the image, let me trace through this step-by-step:
(1) Locating the knife β€” I can see a knife on the left side of the plate.
(2) Finding the bread to the right of the knife β€” there is a large piece of bread ...
(3) Examining what is to the right of that bread β€” gray birds are standing on the plate.
(4) Looking for kittens β€” I do not see any kittens anywhere in the image.
</think>
<answer>
No, there is a bird to the right of the bread.
</answer>

Hyperparameters

Hyperparameter Value
Base model Qwen3-VL-4B-Thinking
Quantization 4-bit NF4 (BitsAndBytes)
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
RSLoRA βœ“
Target modules all-linear
Modules to save lm_head, embed_tokens
Epochs 2
Per-device batch size 4
Gradient accumulation 3 (effective batch = 12)
Learning rate 3 Γ— 10⁻⁡
LR schedule cosine
Warmup ratio 0.05
Max sequence length 32,768
Image max size 640 px
Optimizer AdamW fused
Hardware 1 Γ— A100 80 GB
Training framework HuggingFace Transformers + PEFT

Evaluation

SpatialChain test set (n = 899)

Evaluation uses two complementary axes. Axis 1 measures VQA accuracy (exact match after normalisation). Axis 2 uses a scene-graph-aware LLM judge scoring reasoning faithfulness and completeness independently of the final answer β€” see the evaluation code for the full judge protocol.

Metric Base (4B) This model (4B FT)
VQA Accuracy 78.44% 82.23%
Macro F1 82.01% 86.67%
Yes-accuracy 77.74% 91.34%
No-accuracy 79.64% 66.57%
ROUGE-1 vs. reference chain 0.403 0.657
Token F1 vs. reference chain 0.392 0.646
Reasoning faithfulness (judge) 0.585 0.631
Reasoning completeness (judge) 0.658 0.708
Pass rate 77.6% 80.2%
Shortcut rate ↓ 26.4% 19.4%

Shortcut rate = fraction of correct answers where the judge scores reasoning faithfulness < 0.5. Lower is better.

External benchmarks

SFT on SpatialChain improves in-domain performance but introduces a stylistic specialisation effect on out-of-distribution benchmarks β€” the model adopts the SpatialChain chain format even when the input distribution differs. Replay-augmented training is recommended to mitigate this.

Benchmark Base Fine-tuned Ξ”
SpatialChain test 78.4% 82.2% +3.8 pp
FlagEval/ERQA 45.3% 38.0% βˆ’7.3 pp
FlagEval/EmbSpatial-Bench 79.1% 75.7% βˆ’3.4 pp

Intended Use

  • Spatial VQA β€” binary yes/no questions about object positions and relations in images
  • Reasoning audit β€” producing interpretable spatial chains that can be verified against scene structure
  • Research β€” studying the relationship between chain-of-thought quality and answer correctness in VLMs

Out-of-Scope Use

  • Tasks requiring metric depth or 3D reasoning (scene graphs are symbolic, not metric)
  • Open-ended image captioning or generation
  • Non-English inputs

Bias and Limitations

  • Yes-bias β€” the fine-tuned model exhibits a larger yes/no accuracy gap (+24.8 pp) than the base model (+1.9 pp), consistent with the 58% yes-rate in training data. Evaluation should report Yes-acc and No-acc separately.
  • Stylistic specialisation β€” the model adopts a fixed reasoning format ("Looking at the image, let me trace through this step-by-step…") on all inputs, which may degrade performance on benchmarks with different prompt styles.
  • GQA domain β€” training images are sourced from GQA (Visual Genome); performance on non-natural-image domains is unknown.
  • Projective bias β€” 62.7% of training examples involve left_of / right_of relations; depth-ordered relations (close, far) are underrepresented.

Citation

@article{spatialchain2026,
  title   = {SpatialChain: A Benchmark for Auditing Spatial Reasoning Faithfulness in VLMs},
  author  = {Anonymous},
  journal = {Under review at NeurIPS 2026},
  year    = {2026}
}

Environmental Impact

Training ran for approximately 5 hours on a single A100 80 GB GPU (cloud instance). Carbon emissions can be estimated with the ML Impact Calculator.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for spatialchain/Qwen3-VL-4B-Thinking-SpatialChain

Adapter
(3)
this model

Dataset used to train spatialchain/Qwen3-VL-4B-Thinking-SpatialChain