Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought

Paper | Project Page | GitHub | Training Data | Colab

Molmo2-SGCoT is a fine-tuned version of Molmo2-8B that performs Spatiotemporal Grounded Chain-of-Thought (SGCoT) reasoning — generating explicit object tracking trajectories before answering visual questions. Fine-tuned with only 300 synthetic trajectory samples, it achieves over 90% accuracy on VET-Bench, a shell-game tracking benchmark where state-of-the-art VLMs score at random chance (~33%).

Model Description

Standard vision-language models struggle with tasks that require persistent spatial tracking across video frames. Molmo2-SGCoT addresses this by producing a structured <tracks> chain-of-thought that explicitly traces object positions over time before generating a final answer.

Output Format

<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.
  • t: timestamp in seconds (0.0 to 12.0, step 0.5s)
  • obj: object index (always 1)
  • x, y: spatial coordinates in [0, 1000] normalized space
  • Final answer: left, middle, or right

Results on VET-Bench

Random-chance baseline: 33.3%

Model Cup Card Overall
Gemini 3 Pro Preview 34.0 40.0 37.0
Gemini 2.5 Pro 38.0 30.0 34.0
Qwen3.5-397B-A17B 38.0 32.0 35.0
Molmo2-8B (base) 30.0 38.0 34.0
Molmo2-SGCoT (this model) 92.0 90.0 91.0

Training Details

Hyperparameter Value
Base model allenai/Molmo2-8B
Method QLoRA (4-bit NormalFloat)
LoRA rank 16
LoRA alpha 16
LoRA target modules att_proj, attn_out, ff_proj, ff_out
Training samples 300 (200 cup + 100 card)
Epochs 1
Batch size 64
Learning rate 1e-4
LR scheduler Cosine
Warmup ratio 0.03
Weight decay 0.01
Max sequence length 16,000
Training time ~3 min on a single A100

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "allenai/Molmo2-8B",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "tiedong/Molmo2-SGCoT")
processor = AutoProcessor.from_pretrained(
    "allenai/Molmo2-8B",
    trust_remote_code=True,
)

Inference Example

prompt = "Track the cup that contains the ball and answer which cup contains the ball at the end of the video."

inputs = processor(
    text=prompt,
    videos=["path/to/video.mp4"],
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048)
print(processor.tokenizer.decode(output[0], skip_special_tokens=True))

Training Notebook

Open In Colab

The full training pipeline is available in SGCoT/Molmo2-SGCoT.ipynb. Training takes only ~3 minutes on a single A100 GPU.

Training Data

The training dataset contains only 300 synthetic samples with structured trajectory annotations:

Task Samples Description
Cup tracking 200 Track which cup hides a ball
Card tracking 100 Track the Queen of Hearts through a three-card monte

Trajectories are synthetically generated from movement patterns extracted from real Molmo2 tracking data. Coordinates are encoded in a 1000×1000 normalized space at 0.5-second intervals over 12 seconds.

Limitations

  • Trained specifically on shell-game and card-tracking scenarios; generalization to other tracking tasks has not been evaluated.

Citation

@misc{liu2026visionlanguagemodelssolveshell,
      title={Can Vision-Language Models Solve the Shell Game?}, 
      author={Tiedong Liu and Wee Sun Lee},
      year={2026},
      eprint={2603.08436},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08436}, 
}

License

This model is released under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tiedong/Molmo2-SGCoT

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
allenai/Molmo2-8B
Finetuned
(2)
this model

Dataset used to train tiedong/Molmo2-SGCoT

Paper for tiedong/Molmo2-SGCoT