TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
Paper
β’
2602.08711
β’
Published
TimeChat-Captioner is a multimodal model designed to generate detailed, time-aware, and structurally coherent captions for multi-scene videos. It effectively coordinates visual and audio information to provide comprehensive video descriptions.
Below, we provide simple examples to show how to use TimeChat-Captioner-GRPO-7B with π€ Transformers.
conda create -n timechatcap python=3.12
conda activate timechatcap
pip install torch torchvision
pip install transformers==4.57.1
pip install accelerate
pip install flash-attn --no-build-isolation
# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -U
Note: To annotate high-quality timestamps and captions, limit video input to around 1 minute. Please segment longer videos into around 60-second clips before processing.
import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
# 1. Configuration
MODEL_ID = "yaolily/TimeChat-Captioner-GRPO-7B"
VIDEO_PATH = "example_video.mp4" # <--- Replace with your video path
MAX_PIXELS = 297920
VIDEO_MAX_PIXELS = 297920
print(f"π Processing video: {VIDEO_PATH}")
# 2. Load Model & Processor
print("β³ Loading model...")
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="flash_attention_2"
)
processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
model.disable_talker()
# 3. Construct Conversation
# The prompt encourages detailed, time-aware audio-visual description.
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
},
{
"type": "video",
"video": VIDEO_PATH,
"max_pixels": MAX_PIXELS,
"max_frames": 160,
"fps": 2.0,
"video_max_pixels": VIDEO_MAX_PIXELS
}
],
},
]
# 4. Process Inputs
print("βοΈ Processing inputs...")
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True,
use_audio_in_video=True
)
inputs = inputs.to(model.device).to(model.dtype)
# 5. Generate Description
print("β¨ Generating description...")
with torch.inference_mode():
text_ids = model.generate(
**inputs,
use_audio_in_video=True,
return_audio=False,
thinker_max_new_tokens=9216,
talker_max_tokens=9216
)
response = processor.decode(text_ids[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)
print("\n" + "="*50)
print("π¬ VIDEO DESCRIPTION:")
print("="*50)
print(response)
print("="*50)
@misc{yao2026timechatcaptioner,
title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions},
author={Linli Yao and Yuancheng Wei and Yaojie Zhang and Lei Li and Xinlong Chen and Feifan Song and Ziyue Wang and Kun Ouyang and Yuanxin Liu and Lingpeng Kong and Qi Liu and Pengfei Wan and Kun Gai and Yuanxing Zhang and Xu Sun},
year={2026},
eprint={2602.08711},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.08711}
}
Base model
Qwen/Qwen2.5-Omni-7B