Tempo
Collection
Official Tempo-6B collection: A query-aware framework solving the mismatch between massive video streams and bounded LLM context windows. β’ 6 items β’ Updated β’ 1
This repository contains the Stage 2 intermediate checkpoint for Tempo-6B.
Unlike earlier training stages, this checkpoint is fully capable of direct inference. It has completed short-video and image instruction tuning, making it a strong baseline for general multimodal understanding. However, it has not undergone our final Stage 3 Long-Context SFT.
To demonstrate its capabilities, here is the performance of this Stage 2 checkpoint compared to our final Tempo-6B model:
| Model Setting | LongVideoBench | MLVU | Video-MME (Overall) | Video-MME (Long) | LVBench |
|---|---|---|---|---|---|
| Tempo-6B-Stage2 (w/o ATA) | 61.4 | 67.2 | 66.1 | 56.3 | 47.3 |
| Tempo-6B-Final (w/o ATA) | 62.8 | 73.5 | 67.0 | 56.2 | 51.1 |
| Tempo-6B Final (w/ ATA) | 65.1 | 75.2 | 67.7 | 57.0 | 52.3 |
Note: For a detailed analysis, please check our paper's Ablation Study A (Progressive Training Curriculum).
For the final long video understanding performance, please use our final weights: