TencentARC
/

TimeLens-7B

Video-Text-to-Text

image-text-to-text

video-grounding

temporal-grounding

video-understanding

text-generation-inference

Model card Files Files and versions

JungleGym commited on Jan 13

Commit

1740e26

·

verified ·

1 Parent(s): 578b1b7

Update README.md

Files changed (1) hide show

README.md +16 -2

README.md CHANGED Viewed

@@ -133,10 +133,24 @@ pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
 Using 🤗Transformers for Inference:
 ```python
 import torch
 from transformers import AutoModelForImageTextToText, AutoProcessor
 from qwen_vl_utils import process_vision_info
 # Load model and processor
 model = AutoModelForImageTextToText.from_pretrained(
     "TencentARC/TimeLens-7B",
@@ -153,8 +167,8 @@ processor = AutoProcessor.from_pretrained(
 )
 # Prepare input
-query = "A man is sitting on a chair"
-video_path = "https://huggingface.co/datasets/JungleGym/TimeLens-Assets/blob/main/2Y8XQ.mp4"
 GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."

 Using 🤗Transformers for Inference:
 ```python
+import requests
+import os
 import torch
 from transformers import AutoModelForImageTextToText, AutoProcessor
 from qwen_vl_utils import process_vision_info
+def download_video(url):
+    save_path = os.path.basename(url)
+    if not os.path.exists(save_path):
+        print(f"Downloading video from {url}...")
+        response = requests.get(url, stream=True)
+        response.raise_for_status()
+        with open(save_path, 'wb') as f:
+            for chunk in response.iter_content(chunk_size=8192):
+                f.write(chunk)
+    return save_path
 # Load model and processor
 model = AutoModelForImageTextToText.from_pretrained(
     "TencentARC/TimeLens-7B",
 )
 # Prepare input
+query = "A man drinks water with a glass"
+video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")
 GROUNDER_PROMPT = "You are given a video with multiple frames. The numbers before each video frame indicate its sampling timestamp (in seconds). Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in <start time> - <end time> seconds'."