Tempo-6B-Stage2 (Pre-Long-Context SFT)

This repository contains the Stage 2 intermediate checkpoint for Tempo-6B.

Unlike earlier training stages, this checkpoint is fully capable of direct inference. It has completed short-video and image instruction tuning, making it a strong baseline for general multimodal understanding. However, it has not undergone our final Stage 3 Long-Context SFT.

πŸš€ When to use this checkpoint?

  • Custom Fine-Tuning: An ideal starting point if you want to apply your own long-context SFT curriculum or adapt the model to specialized domains.

πŸ“Š Ablation Performance

To demonstrate its capabilities, here is the performance of this Stage 2 checkpoint compared to our final Tempo-6B model:

Model Setting LongVideoBench MLVU Video-MME (Overall) Video-MME (Long) LVBench
Tempo-6B-Stage2 (w/o ATA) 61.4 67.2 66.1 56.3 47.3
Tempo-6B-Final (w/o ATA) 62.8 73.5 67.0 56.2 51.1
Tempo-6B Final (w/ ATA) 65.1 75.2 67.7 57.0 52.3

Note: For a detailed analysis, please check our paper's Ablation Study A (Progressive Training Curriculum).

πŸ”— Links

For the final long video understanding performance, please use our final weights:

Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Vision-CAIR/Tempo-6B-Stage2

Finetuned
Qwen/Qwen3-4B
Finetuned
(571)
this model

Collection including Vision-CAIR/Tempo-6B-Stage2

Paper for Vision-CAIR/Tempo-6B-Stage2