Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Paper | Project Page | GitHub | Training Data | Colab

Molmo2-SGCoT is a fine-tuned version of Molmo2-8B that performs Spatiotemporal Grounded Chain-of-Thought (SGCoT) reasoning — generating explicit object tracking trajectories before answering visual questions. Fine-tuned with only 300 synthetic trajectory samples, it achieves over 90% accuracy on VET-Bench, a shell-game tracking benchmark where state-of-the-art VLMs score at random chance (~33%).

Model Description

Standard vision-language models struggle with tasks that require persistent spatial tracking across video frames. Molmo2-SGCoT addresses this by producing a structured <tracks> chain-of-thought that explicitly traces object positions over time before generating a final answer.

Output Format

<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.

t: timestamp in seconds (0.0 to 12.0, step 0.5s)
obj: object index (always 1)
x, y: spatial coordinates in [0, 1000] normalized space
Final answer: left, middle, or right

Results on VET-Bench

Random-chance baseline: 33.3%

Model	Cup	Card	Overall
Gemini 3 Pro Preview	34.0	40.0	37.0
Gemini 2.5 Pro	38.0	30.0	34.0
Qwen3.5-397B-A17B	38.0	32.0	35.0
Molmo2-8B (base)	30.0	38.0	34.0
Molmo2-SGCoT (this model)	92.0	90.0	91.0

Training Details

Hyperparameter	Value
Base model	allenai/Molmo2-8B
Method	QLoRA (4-bit NormalFloat)
LoRA rank	16
LoRA alpha	16
LoRA target modules	`att_proj`, `attn_out`, `ff_proj`, `ff_out`
Training samples	300 (200 cup + 100 card)
Epochs	1
Batch size	64
Learning rate	1e-4
LR scheduler	Cosine
Warmup ratio	0.03
Weight decay	0.01
Max sequence length	16,000
Training time	~3 min on a single A100

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "allenai/Molmo2-8B",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "tiedong/Molmo2-SGCoT")
processor = AutoProcessor.from_pretrained(
    "allenai/Molmo2-8B",
    trust_remote_code=True,
)

Inference Example

prompt = "Track the cup that contains the ball and answer which cup contains the ball at the end of the video."

inputs = processor(
    text=prompt,
    videos=["path/to/video.mp4"],
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
print(processor.tokenizer.decode(output[0], skip_special_tokens=True))

Training Notebook

The full training pipeline is available in SGCoT/Molmo2-SGCoT.ipynb. Training takes only ~3 minutes on a single A100 GPU.

Training Data

The training dataset contains only 300 synthetic samples with structured trajectory annotations:

Task	Samples	Description
Cup tracking	200	Track which cup hides a ball
Card tracking	100	Track the Queen of Hearts through a three-card monte

Trajectories are synthetically generated from movement patterns extracted from real Molmo2 tracking data. Coordinates are encoded in a 1000×1000 normalized space at 0.5-second intervals over 12 seconds.

Limitations

Trained specifically on shell-game and card-tracking scenarios; generalization to other tracking tasks has not been evaluated.

Citation

@misc{liu2026visionlanguagemodelssolveshell,
      title={Can Vision-Language Models Solve the Shell Game?}, 
      author={Tiedong Liu and Wee Sun Lee},
      year={2026},
      eprint={2603.08436},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08436}, 
}