TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

🌟 Overview

TimeChat-Captioner is a multimodal model designed to generate detailed, time-aware, and structurally coherent captions for multi-scene videos. It effectively coordinates visual and audio information to provide comprehensive video descriptions.

🌐 Project Page: timechat-captioner.github.io
🏠 Model: TimeChat-Captioner (7B)
💻 Code: GitHub Repository
📚 Train Dataset: TimeChatCap-42K
🏆 Benchmark: OmniDCBench

🚀 Quick Start

Below, we provide simple examples to show how to use TimeChat-Captioner-GRPO-7B with 🤗 Transformers.

Installation

conda create -n timechatcap python=3.12
conda activate timechatcap
pip install torch torchvision
pip install transformers==4.57.1
pip install accelerate
pip install flash-attn --no-build-isolation
# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -U

Usage

Note: To annotate high-quality timestamps and captions, limit video input to around 1 minute. Please segment longer videos into around 60-second clips before processing.

import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# 1. Configuration
MODEL_ID = "yaolily/TimeChat-Captioner-GRPO-7B"
VIDEO_PATH = "example_video.mp4"  # <--- Replace with your video path

MAX_PIXELS = 297920
VIDEO_MAX_PIXELS = 297920


print(f"🚀 Processing video: {VIDEO_PATH}")

# 2. Load Model & Processor
print("⏳ Loading model...")
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
model.disable_talker()

# 3. Construct Conversation
# The prompt encourages detailed, time-aware audio-visual description.
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text", 
                "text": "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
            },
            {
                "type": "video", 
                "video": VIDEO_PATH, 
                "max_pixels": MAX_PIXELS, 
                "max_frames": 160, 
                "fps": 2.0,
                "video_max_pixels": VIDEO_MAX_PIXELS
            }
        ],
    },
]

# 4. Process Inputs
print("⚙️  Processing inputs...")
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)

inputs = processor(
    text=text, 
    audio=audios, 
    images=images, 
    videos=videos, 
    return_tensors="pt", 
    padding=True, 
    use_audio_in_video=True
)
inputs = inputs.to(model.device).to(model.dtype)

# 5. Generate Description
print("✨ Generating description...")
with torch.inference_mode():
    text_ids = model.generate(
        **inputs, 
        use_audio_in_video=True, 
        return_audio=False,
        thinker_max_new_tokens=9216,
        talker_max_tokens=9216
    )

response = processor.decode(text_ids[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)

print("
" + "="*50)
print("🎬 VIDEO DESCRIPTION:")
print("="*50)
print(response)
print("="*50)

📖 Citation

@misc{yao2026timechatcaptioner,
      title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions}, 
      author={Linli Yao and Yuancheng Wei and Yaojie Zhang and Lei Li and Xinlong Chen and Feifan Song and Ziyue Wang and Kun Ouyang and Yuanxin Liu and Lingpeng Kong and Qi Liu and Pengfei Wan and Kun Gai and Yuanxing Zhang and Xu Sun},
      year={2026},
      eprint={2602.08711},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.08711}
}

Downloads last month: 2,812

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yaolily/TimeChat-Captioner-GRPO-7B

Base model

Qwen/Qwen2.5-Omni-7B

Finetuned

(58)

this model

Paper for yaolily/TimeChat-Captioner-GRPO-7B

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Paper • 2602.08711 • Published Feb 9 • 29