---
license: other
license_name: bsd-3-clause
license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE
language:
- en
tags:
- video-grounding
- temporal-grounding
- video-understanding
- qwen3-vl
library_name: transformers
pipeline_tag: video-text-to-text
datasets:
- TencentARC/TimeLens-100K
- TencentARC/TimeLens-Bench
base_model:
- Qwen/Qwen3-VL-8B-Instruct
---
# TimeLens-8B
π [**Paper**](https://arxiv.org/abs/2512.14698) | π» [**Code**](https://github.com/TencentARC/TimeLens) | π [**Project Page**](https://timelens-arc-lab.github.io/) | π€ [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens)
## β¨ Model Description
**TimeLens-8B** is an MLLM with state-of-the-art video temporal grounding performance among open-source models, finetuned from [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). It is trained with carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).
## π Performance
TimeLens-8B achieves state-of-the-art video temporal grounding performance among open-source models:
| Model |
Charades-TimeLens |
ActivityNet-TimeLens |
QVHighlights-TimeLens |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
R1 @0.3 |
R1 @0.5 |
R1 @0.7 |
mIoU |
| Qwen2.5-VL-7B-Instruct |
59.7 |
37.8 |
16.6 |
39.3 |
44.1 |
31.0 |
16.1 |
31.4 |
41.5 |
27.8 |
15.2 |
31.6 |
| TimeLens-7Bπ |
70.5 |
55.6 |
28.4 |
48.8 |
62.8 |
51.0 |
32.6 |
46.2 |
74.1 |
62.7 |
43.1 |
56.0 |
| Qwen3-VL-8B-Instruct |
69.2 |
53.4 |
27.5 |
48.3 |
62.1 |
51.2 |
34.4 |
46.8 |
74.2 |
64.6 |
49.3 |
59.4 |
| TimeLens-8Bπ |
76.6 |
63.0 |
35.2 |
55.2 |
68.9 |
58.4 |
40.6 |
53.2 |
80.2 |
71.6 |
55.5 |
65.5 |
> For detailed comparison with other models, please refer to the π [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard).
## π Usage
Install the following packages:
```bash
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
```
Using π€Transformers for Inference:
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
"TencentARC/TimeLens-8B",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"TencentARC/TimeLens-8B",
padding_side="left",
do_resize=False,
)
# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"
GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in - seconds'."
messages = [{
'role': 'user',
'content': [
{
'type': 'video',
'video': video_path,
'min_pixels': 64 * 28 * 28,
'total_pixels': 14336 * 28 * 28,
'fps': 2,
},
{
'type': 'text',
'text': GROUNDER_PROMPT.format(query)
}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
messages,
image_patch_size=16,
return_video_kwargs=True,
return_video_metadata=True,
)
videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)
inputs = processor(
text=[text],
images=images,
videos=videos,
video_metadata=video_metadatas,
padding=True,
return_tensors='pt',
**video_kwargs,
).to("cuda")
output_ids = model.generate(
**inputs,
do_sample=False,
max_new_tokens=512,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")
```
## Citation
If you find our work helpful for your research and applications, please cite our paper:
```bibtex
@article{zhang2025timelens,
title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
journal={arXiv preprint arXiv:2512.14698},
year={2025}
}
```