File size: 7,226 Bytes

---
license: other
license_name: bsd-3-clause
license_link: https://github.com/TencentARC/TimeLens/blob/main/LICENSE
language:
- en
tags:
- video-grounding
- temporal-grounding
- video-understanding
- qwen2-vl
library_name: transformers
pipeline_tag: video-text-to-text
datasets:
- TencentARC/TimeLens-100K
- TencentARC/TimeLens-Bench
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# TimeLens-7B

📑 [**Paper**](https://arxiv.org/abs/2512.14698) | 💻 [**Code**](https://github.com/TencentARC/TimeLens) | 🏠 [**Project Page**](https://timelens-arc-lab.github.io/) | 🤗 [**Model & Data**](https://huggingface.co/collections/TencentARC/timelens)


## ✨ Model Description

**TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).

## 📊 Performance

TimeLens-7B achieves strong video temporal grounding performance:

<table>
  <thead>
    <tr>
      <th rowspan="2" align="center">Model</th>
      <th colspan="4" align="center">Charades-TimeLens</th>
      <th colspan="4" align="center">ActivityNet-TimeLens</th>
      <th colspan="4" align="center">QVHighlights-TimeLens</th>
    </tr>
    <tr>
      <th align="center">R1<br>@0.3</th>
      <th align="center">R1<br>@0.5</th>
      <th align="center">R1<br>@0.7</th>
      <th align="center">mIoU</th>
      <th align="center">R1<br>@0.3</th>
      <th align="center">R1<br>@0.5</th>
      <th align="center">R1<br>@0.7</th>
      <th align="center">mIoU</th>
      <th align="center">R1<br>@0.3</th>
      <th align="center">R1<br>@0.5</th>
      <th align="center">R1<br>@0.7</th>
      <th align="center">mIoU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen2.5-VL-7B-Instruct</a></td>
      <td align="center">59.7</td>
      <td align="center">37.8</td>
      <td align="center">16.6</td>
      <td align="center">39.3</td>
      <td align="center">44.1</td>
      <td align="center">31.0</td>
      <td align="center">16.1</td>
      <td align="center">31.4</td>
      <td align="center">41.5</td>
      <td align="center">27.8</td>
      <td align="center">15.2</td>
      <td align="center">31.6</td>
    </tr>
    <tr>
      <td><a href="https://huggingface.co/TencentARC/TimeLens-7B"><b>TimeLens-7B</b>🚀</a></td>
      <td align="center"><b>70.5</b></td>
      <td align="center"><b>55.6</b></td>
      <td align="center"><b>28.4</b></td>
      <td align="center"><b>48.8</b></td>
      <td align="center"><b>62.8</b></td>
      <td align="center"><b>51.0</b></td>
      <td align="center"><b>32.6</b></td>
      <td align="center"><b>46.2</b></td>
      <td align="center"><b>74.1</b></td>
      <td align="center"><b>62.7</b></td>
      <td align="center"><b>43.1</b></td>
      <td align="center"><b>56.0</b></td>
    </tr>
    <tr>
      <td><a href="https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct">Qwen3-VL-8B-Instruct</a></td>
      <td align="center">69.2</td>
      <td align="center">53.4</td>
      <td align="center">27.5</td>
      <td align="center">48.3</td>
      <td align="center">62.1</td>
      <td align="center">51.2</td>
      <td align="center">34.4</td>
      <td align="center">46.8</td>
      <td align="center">74.2</td>
      <td align="center">64.6</td>
      <td align="center">49.3</td>
      <td align="center">59.4</td>
    </tr>
    <tr>
      <td><a href="https://huggingface.co/TencentARC/TimeLens-8B"><b>TimeLens-8B</b>🚀</a></td>
      <td align="center"><b>76.6</b></td>
      <td align="center"><b>63.0</b></td>
      <td align="center"><b>35.2</b></td>
      <td align="center"><b>55.2</b></td>
      <td align="center"><b>68.9</b></td>
      <td align="center"><b>58.4</b></td>
      <td align="center"><b>40.6</b></td>
      <td align="center"><b>53.2</b></td>
      <td align="center"><b>80.2</b></td>
      <td align="center"><b>71.6</b></td>
      <td align="center"><b>55.5</b></td>
      <td align="center"><b>65.5</b></td>
    </tr>
  </tbody>
</table>

> For detailed comparison with other models, please refer to the 🏆 [Leaderboard](https://timelens-arc-lab.github.io/#leaderboard).


## 🚀 Usage

Install the following packages:
```bash
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
# use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir  
```

Using 🤗Transformers for Inference:
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-7B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-7B",
    padding_side="left",
    do_resize=False,
    trust_remote_code=True,
)

# Prepare input
query = "A man is sitting on a chair"
video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"

GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 28 * 28,
            'total_pixels': 14336 * 28 * 28,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=images,
    videos=videos,
    padding=True,
    return_tensors='pt'
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")
```

## Citation

If you find our work helpful for your research and applications, please cite our paper:

```bibtex
@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}
```