File size: 5,170 Bytes
30b48ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
---

# DisTime: Distribution-based Time Representation for Video Large Language Models

[Paper](https://huggingface.co/papers/2505.24329) | [GitHub Repository](https://github.com/josephzpng/DisTime)

## Abstract
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.

<div align="center">
  <img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/>
</div>

## Usage

You can easily load the model using the `transformers` library. The following example demonstrates how to perform inference with DisTime:

```python
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from decord import cpu, VideoReader

# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
model = AutoModel.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)

model.eval()
video_path = "./examples/video1.mp4" # Replace with your video path
qs = "Describe this video in detail"

# Load video frames
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) # Sample frames at 1 fps
video = [vr[frame_index].asnumpy() for frame_index in frame_indices]
video = np.stack(video)

# Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video},
            {"type": "text", "text": qs},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
video_inputs = processor.process_video(messages) # Process video frames
inputs = processor(text=[text], videos=video_inputs, padding=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate output
with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
```

## Models and Data

### Models  
- [DisTime-1B](https://huggingface.co/UserJoseph/DisTime-1B)
- [DisTime-8B](https://huggingface.co/UserJoseph/DisTime-8B) 

### InternVid-TG

In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at [https://huggingface.co/datasets/yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg).

<div align="center">
  <img src="https://github.com/josephzpng/DisTime/raw/main/images/internvid-tg.png" width="600px"/>
</div>

## Citation
```bibtex
@article{zeng2025distime,
  title={DisTime: Distribution-based Time Representation for Video Large Language Models},
  author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
  journal={arXiv preprint arXiv:2505.24329},
  year={2025}
}
```

## Acknowledgement

DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.