| | --- |
| | license: apache-2.0 |
| | pipeline_tag: video-text-to-text |
| | library_name: transformers |
| | --- |
| | |
| | # DisTime: Distribution-based Time Representation for Video Large Language Models |
| |
|
| | [Paper](https://huggingface.co/papers/2505.24329) | [GitHub Repository](https://github.com/josephzpng/DisTime) |
| |
|
| | ## Abstract |
| | Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. |
| |
|
| | <div align="center"> |
| | <img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/> |
| | </div> |
| |
|
| | ## Usage |
| |
|
| | You can easily load the model using the `transformers` library. The following example demonstrates how to perform inference with DisTime: |
| |
|
| | ```python |
| | import numpy as np |
| | import torch |
| | from transformers import AutoTokenizer, AutoModel, AutoProcessor |
| | from decord import cpu, VideoReader |
| | |
| | # Load the model and processor |
| | tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True) |
| | model = AutoModel.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda() |
| | processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True) |
| | |
| | model.eval() |
| | video_path = "./examples/video1.mp4" # Replace with your video path |
| | qs = "Describe this video in detail" |
| | |
| | # Load video frames |
| | vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) |
| | fps = float(vr.get_avg_fps()) |
| | frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) # Sample frames at 1 fps |
| | video = [vr[frame_index].asnumpy() for frame_index in frame_indices] |
| | video = np.stack(video) |
| | |
| | # Prepare inputs |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "video", "video": video}, |
| | {"type": "text", "text": qs}, |
| | ], |
| | } |
| | ] |
| | |
| | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | video_inputs = processor.process_video(messages) # Process video frames |
| | inputs = processor(text=[text], videos=video_inputs, padding=True, return_tensors="pt") |
| | inputs = {k: v.to(model.device) for k, v in inputs.items()} |
| | |
| | # Generate output |
| | with torch.inference_mode(): |
| | output_ids = model.generate( |
| | **inputs, |
| | do_sample=False, |
| | temperature=0.2, |
| | max_new_tokens=128, |
| | use_cache=True, |
| | ) |
| | |
| | pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() |
| | print(pred) |
| | ``` |
| |
|
| | ## Models and Data |
| |
|
| | ### Models |
| | - [DisTime-1B](https://huggingface.co/UserJoseph/DisTime-1B) |
| | - [DisTime-8B](https://huggingface.co/UserJoseph/DisTime-8B) |
| |
|
| | ### InternVid-TG |
| |
|
| | In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at [https://huggingface.co/datasets/yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg). |
| |
|
| | <div align="center"> |
| | <img src="https://github.com/josephzpng/DisTime/raw/main/images/internvid-tg.png" width="600px"/> |
| | </div> |
| |
|
| | ## Citation |
| | ```bibtex |
| | @article{zeng2025distime, |
| | title={DisTime: Distribution-based Time Representation for Video Large Language Models}, |
| | author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang}, |
| | journal={arXiv preprint arXiv:2505.24329}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgement |
| |
|
| | DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models. |