qunwang13
/

vr-thinker

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

qunwang13 commited on Oct 17, 2025

Commit

c8e14e4

verified ·

1 Parent(s): 881abb6

Update README.md

Browse files

Files changed (1) hide show

README.md +121 -0

README.md CHANGED Viewed

@@ -17,6 +17,127 @@ For further details, please refer to the following:
 - 📚 Github: https://github.com/qunzhongwang/vr-thinker
 - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
 ## Citation
 ```

 - 📚 Github: https://github.com/qunzhongwang/vr-thinker
 - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
+### Quick Start
+We provide a sample test interface here:
+~~~python
+import json
+import random
+import torch
+import tqdm
+from PIL import Image
+import warnings
+import os
+import requests
+import cv2
+import numpy as np
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+from qwen_vl_utils import process_vision_info
+warnings.filterwarnings("ignore")
+model_path = "qunwang13/vr-thinker"
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_path, torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_path)
+video_urls = [
+    "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
+    "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4"  # sample video 2
+]
+prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
+dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
+dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
+dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
+N = 150
+prompt_text = \
+f"""Task Description:
+Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension. This involves:
+- Iterative reasoning,
+- Zooming in on details,
+- Dynamically selecting frames for further analysis.
+The provided frames are downsampled from these videos:
+- Video 1: First four input frames.
+- Video 2: Next four input frames.
+The prompt is: {prompt_for_videos}
+Evaluation Dimensions:
+1. {dim_name_1}(TA):
+{dim_explain_1}
+2. {dim_name_2}(VQ):
+{dim_explain_2}
+3. {dim_name_3}(MQ):
+{dim_explain_3}
+Frames and Analysis Rules
+- 8 sampled frames are provided, evenly downsampled from {N} frames
+- Insufficient frames? Request more:
+<tool_call>{{"target_frames": []}}</tool_call>
+Format Requirement:
+1. Snapshot:
+Every time you receive new visual information, summarize any information that might be useful for your final judgment within <Snapshot></Snapshot> tags.
+2. Think:
+Place all reasoning content within <Think></Think> tags.
+3. Answer:
+If the final answer can be determined, output the answer within <Answer></Answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <Recommend Answer></Recommend Answer> tags.
+Here, 1 represents Video 1, 2 represents Video 2, and 0 represents Tie. The confidence levels range from high to low as 1, 2, and 3.
+Examples:
+<Answer>TA=1, VQ=1, MQ=0, OA=1</Answer>, or
+<Recommend Answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</Recommend Answer>
+"""
+content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
+content_list.append({"type": "text", "text": prompt_text})
+messages = [
+    {
+        "role": "user",
+        "content": content_list,
+    }
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+~~~
 ## Citation
 ```