Tempo-6B-Stage2 (Pre-Long-Context SFT)

Final Model: Vision-CAIR/Tempo-6B
GitHub Code: FeiElysia/Tempo
Paper: Small Vision-Language Models are Smart Compressors for Long Video Understanding

This repository contains the Stage 2 intermediate checkpoint for Tempo-6B.

Unlike earlier training stages, this checkpoint is fully capable of direct inference. It has completed short-video and image instruction tuning, making it a strong baseline for general multimodal understanding. However, it has not undergone our final Stage 3 Long-Context SFT.

🚀 When to use this checkpoint?

Custom Fine-Tuning: An ideal starting point if you want to apply your own long-context SFT curriculum or adapt the model to specialized domains.

📊 Ablation Performance

To demonstrate its capabilities, here is the performance of this Stage 2 checkpoint compared to our final Tempo-6B model:

Model Setting	LongVideoBench	MLVU	Video-MME (Overall)	Video-MME (Long)	LVBench
Tempo-6B-Stage2 (w/o ATA)	61.4	67.2	66.1	56.3	47.3
Tempo-6B-Final (w/o ATA)	62.8	73.5	67.0	56.2	51.1
Tempo-6B Final (w/ ATA)	65.1	75.2	67.7	57.0	52.3