Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought
Paper | Project Page | GitHub | Training Data | Colab
Molmo2-SGCoT is a fine-tuned version of Molmo2-8B that performs Spatiotemporal Grounded Chain-of-Thought (SGCoT) reasoning — generating explicit object tracking trajectories before answering visual questions. Fine-tuned with only 300 synthetic trajectory samples, it achieves over 90% accuracy on VET-Bench, a shell-game tracking benchmark where state-of-the-art VLMs score at random chance (~33%).
Model Description
Standard vision-language models struggle with tasks that require persistent spatial tracking across video frames. Molmo2-SGCoT addresses this by producing a structured <tracks> chain-of-thought that explicitly traces object positions over time before generating a final answer.
Output Format
<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.
- t: timestamp in seconds (0.0 to 12.0, step 0.5s)
- obj: object index (always 1)
- x, y: spatial coordinates in [0, 1000] normalized space
- Final answer:
left,middle, orright
Results on VET-Bench
Random-chance baseline: 33.3%
| Model | Cup | Card | Overall |
|---|---|---|---|
| Gemini 3 Pro Preview | 34.0 | 40.0 | 37.0 |
| Gemini 2.5 Pro | 38.0 | 30.0 | 34.0 |
| Qwen3.5-397B-A17B | 38.0 | 32.0 | 35.0 |
| Molmo2-8B (base) | 30.0 | 38.0 | 34.0 |
| Molmo2-SGCoT (this model) | 92.0 | 90.0 | 91.0 |
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | allenai/Molmo2-8B |
| Method | QLoRA (4-bit NormalFloat) |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| LoRA target modules | att_proj, attn_out, ff_proj, ff_out |
| Training samples | 300 (200 cup + 100 card) |
| Epochs | 1 |
| Batch size | 64 |
| Learning rate | 1e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Max sequence length | 16,000 |
| Training time | ~3 min on a single A100 |
Usage
Loading the Model
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"allenai/Molmo2-8B",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "tiedong/Molmo2-SGCoT")
processor = AutoProcessor.from_pretrained(
"allenai/Molmo2-8B",
trust_remote_code=True,
)
Inference Example
prompt = "Track the cup that contains the ball and answer which cup contains the ball at the end of the video."
inputs = processor(
text=prompt,
videos=["path/to/video.mp4"],
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
print(processor.tokenizer.decode(output[0], skip_special_tokens=True))
Training Notebook
The full training pipeline is available in SGCoT/Molmo2-SGCoT.ipynb. Training takes only ~3 minutes on a single A100 GPU.
Training Data
The training dataset contains only 300 synthetic samples with structured trajectory annotations:
| Task | Samples | Description |
|---|---|---|
| Cup tracking | 200 | Track which cup hides a ball |
| Card tracking | 100 | Track the Queen of Hearts through a three-card monte |
Trajectories are synthetically generated from movement patterns extracted from real Molmo2 tracking data. Coordinates are encoded in a 1000×1000 normalized space at 0.5-second intervals over 12 seconds.
Limitations
- Trained specifically on shell-game and card-tracking scenarios; generalization to other tracking tasks has not been evaluated.
Citation
@misc{liu2026visionlanguagemodelssolveshell,
title={Can Vision-Language Models Solve the Shell Game?},
author={Tiedong Liu and Wee Sun Lee},
year={2026},
eprint={2603.08436},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08436},
}
License
This model is released under the Apache 2.0 License.