Improve model card: Add metadata, paper link, code link, and usage
Browse filesThis PR significantly improves the model card for the DisTime model by adding crucial metadata and comprehensive usage instructions.
Specifically, this PR:
- Adds the `pipeline_tag: video-text-to-text`, ensuring the model is discoverable in relevant searches on the Hugging Face Hub.
- Adds `library_name: transformers` to indicate compatibility with the Hugging Face `transformers` library, enabling the "How to use" widget on the model page.
- Links the model to its official Hugging Face Paper page: [DisTime: Distribution-based Time Representation for Video Large Language Models](https://huggingface.co/papers/2505.24329).
- Includes a link to the official GitHub repository for easy access to the codebase.
- Provides a detailed "Usage" section with a Python code example demonstrating inference using the `transformers` library.
- Integrates the paper's abstract, key images (network diagram, InternVid-TG dataset examples), models and data information, citation, and acknowledgements from the original GitHub repository for a complete overview.
This improvement aims to make the model more discoverable and user-friendly for the community.
|
@@ -1,3 +1,100 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# DisTime: Distribution-based Time Representation for Video Large Language Models
|
| 8 |
+
|
| 9 |
+
[Paper](https://huggingface.co/papers/2505.24329) | [GitHub Repository](https://github.com/josephzpng/DisTime)
|
| 10 |
+
|
| 11 |
+
## Abstract
|
| 12 |
+
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/>
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
## Usage
|
| 19 |
+
|
| 20 |
+
You can easily load the model using the `transformers` library. The following example demonstrates how to perform inference with DisTime:
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
import numpy as np
|
| 24 |
+
import torch
|
| 25 |
+
from transformers import AutoTokenizer, AutoModel, AutoProcessor
|
| 26 |
+
from decord import cpu, VideoReader
|
| 27 |
+
|
| 28 |
+
# Load the model and processor
|
| 29 |
+
tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
|
| 30 |
+
model = AutoModel.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
|
| 31 |
+
processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
|
| 32 |
+
|
| 33 |
+
model.eval()
|
| 34 |
+
video_path = "./examples/video1.mp4" # Replace with your video path
|
| 35 |
+
qs = "Describe this video in detail"
|
| 36 |
+
|
| 37 |
+
# Load video frames
|
| 38 |
+
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
|
| 39 |
+
fps = float(vr.get_avg_fps())
|
| 40 |
+
frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) # Sample frames at 1 fps
|
| 41 |
+
video = [vr[frame_index].asnumpy() for frame_index in frame_indices]
|
| 42 |
+
video = np.stack(video)
|
| 43 |
+
|
| 44 |
+
# Prepare inputs
|
| 45 |
+
messages = [
|
| 46 |
+
{
|
| 47 |
+
"role": "user",
|
| 48 |
+
"content": [
|
| 49 |
+
{"type": "video", "video": video},
|
| 50 |
+
{"type": "text", "text": qs},
|
| 51 |
+
],
|
| 52 |
+
}
|
| 53 |
+
]
|
| 54 |
+
|
| 55 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 56 |
+
video_inputs = processor.process_video(messages) # Process video frames
|
| 57 |
+
inputs = processor(text=[text], videos=video_inputs, padding=True, return_tensors="pt")
|
| 58 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 59 |
+
|
| 60 |
+
# Generate output
|
| 61 |
+
with torch.inference_mode():
|
| 62 |
+
output_ids = model.generate(
|
| 63 |
+
**inputs,
|
| 64 |
+
do_sample=False,
|
| 65 |
+
temperature=0.2,
|
| 66 |
+
max_new_tokens=128,
|
| 67 |
+
use_cache=True,
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
|
| 71 |
+
print(pred)
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Models and Data
|
| 75 |
+
|
| 76 |
+
### Models
|
| 77 |
+
- [DisTime-1B](https://huggingface.co/UserJoseph/DisTime-1B)
|
| 78 |
+
- [DisTime-8B](https://huggingface.co/UserJoseph/DisTime-8B)
|
| 79 |
+
|
| 80 |
+
### InternVid-TG
|
| 81 |
+
|
| 82 |
+
In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at [https://huggingface.co/datasets/yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg).
|
| 83 |
+
|
| 84 |
+
<div align="center">
|
| 85 |
+
<img src="https://github.com/josephzpng/DisTime/raw/main/images/internvid-tg.png" width="600px"/>
|
| 86 |
+
</div>
|
| 87 |
+
|
| 88 |
+
## Citation
|
| 89 |
+
```bibtex
|
| 90 |
+
@article{zeng2025distime,
|
| 91 |
+
title={DisTime: Distribution-based Time Representation for Video Large Language Models},
|
| 92 |
+
author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
|
| 93 |
+
journal={arXiv preprint arXiv:2505.24329},
|
| 94 |
+
year={2025}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## Acknowledgement
|
| 99 |
+
|
| 100 |
+
DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.
|