Instructions to use tiedong/Molmo2-SGCoT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiedong/Molmo2-SGCoT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="tiedong/Molmo2-SGCoT")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("tiedong/Molmo2-SGCoT", dtype="auto") - PEFT
How to use tiedong/Molmo2-SGCoT with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiedong/Molmo2-SGCoT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiedong/Molmo2-SGCoT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiedong/Molmo2-SGCoT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiedong/Molmo2-SGCoT
- SGLang
How to use tiedong/Molmo2-SGCoT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiedong/Molmo2-SGCoT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiedong/Molmo2-SGCoT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiedong/Molmo2-SGCoT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiedong/Molmo2-SGCoT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiedong/Molmo2-SGCoT with Docker Model Runner:
docker model run hf.co/tiedong/Molmo2-SGCoT
Molmo2-SGCoT: Spatiotemporal Grounded Chain-of-Thought
Paper | Project Page | GitHub | Training Data | Colab
Molmo2-SGCoT is a fine-tuned version of Molmo2-8B that performs Spatiotemporal Grounded Chain-of-Thought (SGCoT) reasoning — generating explicit object tracking trajectories before answering visual questions. Fine-tuned with only 300 synthetic trajectory samples, it achieves over 90% accuracy on VET-Bench, a shell-game tracking benchmark where state-of-the-art VLMs score at random chance (~33%).
Model Description
Standard vision-language models struggle with tasks that require persistent spatial tracking across video frames. Molmo2-SGCoT addresses this by producing a structured <tracks> chain-of-thought that explicitly traces object positions over time before generating a final answer.
Output Format
<tracks coords="0.0 1 772 524;0.5 1 805 310;...;12.0 1 216 517">the cup that contains the ball</tracks> Answer: left.
- t: timestamp in seconds (0.0 to 12.0, step 0.5s)
- obj: object index (always 1)
- x, y: spatial coordinates in [0, 1000] normalized space
- Final answer:
left,middle, orright
Results on VET-Bench
Random-chance baseline: 33.3%
| Model | Cup | Card | Overall |
|---|---|---|---|
| Gemini 3 Pro Preview | 34.0 | 40.0 | 37.0 |
| Gemini 2.5 Pro | 38.0 | 30.0 | 34.0 |
| Qwen3.5-397B-A17B | 38.0 | 32.0 | 35.0 |
| Molmo2-8B (base) | 30.0 | 38.0 | 34.0 |
| Molmo2-SGCoT (this model) | 92.0 | 90.0 | 91.0 |
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | allenai/Molmo2-8B |
| Method | QLoRA (4-bit NormalFloat) |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| LoRA target modules | att_proj, attn_out, ff_proj, ff_out |
| Training samples | 300 (200 cup + 100 card) |
| Epochs | 1 |
| Batch size | 64 |
| Learning rate | 1e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Max sequence length | 16,000 |
| Training time | ~3 min on a single A100 |
Usage
Loading the Model
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"allenai/Molmo2-8B",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "tiedong/Molmo2-SGCoT")
processor = AutoProcessor.from_pretrained(
"allenai/Molmo2-8B",
trust_remote_code=True,
)
Inference Example
prompt = "Track the cup that contains the ball and answer which cup contains the ball at the end of the video."
inputs = processor(
text=prompt,
videos=["path/to/video.mp4"],
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
print(processor.tokenizer.decode(output[0], skip_special_tokens=True))
Training Notebook
The full training pipeline is available in SGCoT/Molmo2-SGCoT.ipynb. Training takes only ~3 minutes on a single A100 GPU.
Training Data
The training dataset contains only 300 synthetic samples with structured trajectory annotations:
| Task | Samples | Description |
|---|---|---|
| Cup tracking | 200 | Track which cup hides a ball |
| Card tracking | 100 | Track the Queen of Hearts through a three-card monte |
Trajectories are synthetically generated from movement patterns extracted from real Molmo2 tracking data. Coordinates are encoded in a 1000×1000 normalized space at 0.5-second intervals over 12 seconds.
Limitations
- Trained specifically on shell-game and card-tracking scenarios; generalization to other tracking tasks has not been evaluated.
Citation
@misc{liu2026visionlanguagemodelssolveshell,
title={Can Vision-Language Models Solve the Shell Game?},
author={Tiedong Liu and Wee Sun Lee},
year={2026},
eprint={2603.08436},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08436},
}
License
This model is released under the Apache 2.0 License.