TencentARC
/

TimeLens-7B

Video-Text-to-Text

video-grounding

temporal-grounding

video-understanding

text-generation-inference

Model card Files Files and versions

JungleGym commited on 10 days ago

Commit

b01958c

·

verified ·

1 Parent(s): 3c8f7cb

Update README.md

Files changed (1) hide show

README.md +9 -3

README.md CHANGED Viewed

@@ -11,6 +11,11 @@ tags:
 - qwen2-vl
 library_name: transformers
 pipeline_tag: video-text-to-text
 ---
 # TimeLens-7B
@@ -22,7 +27,7 @@ pipeline_tag: video-text-to-text
 **TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).
-## Performance
 TimeLens-7B achieves strong video temporal grounding performance:
@@ -122,7 +127,8 @@ Install the following packages:
 ```bash
 pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
 pip install qwen-vl-utils[decord]==0.0.14
-pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir  # Flash-Attention 2 to speed up generation
 ```
 Using 🤗Transformers for Inference:
@@ -201,4 +207,4 @@ If you find our work helpful for your research and applications, please cite our
 ```bibtex
 TODO
-```

 - qwen2-vl
 library_name: transformers
 pipeline_tag: video-text-to-text
+datasets:
+- TencentARC/TimeLens-100K
+- TencentARC/TimeLens-Bench
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 ---
 # TimeLens-7B
 **TimeLens-7B** is an MLLM with strong video temporal grounding (VTG) capability, fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is trained with a carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe and improved timestamp encoding strategy proposed in our [paper](TODO), utilizing our high-quality VTG training dataset [TimeLens-100K](https://huggingface.co/datasets/TencentARC/TimeLens-100K).
+## 📊 Performance
 TimeLens-7B achieves strong video temporal grounding performance:
 ```bash
 pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
 pip install qwen-vl-utils[decord]==0.0.14
+# use Flash-Attention 2 to speed up generation
+pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
 ```
 Using 🤗Transformers for Inference:
 ```bibtex
 TODO
+```