File size: 4,439 Bytes
55500d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/share/liangzy/miniconda3/envs/vllm/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
Total Video Size: 31436

  0%|                                                                                      | 0/31436 [00:00<?, ?it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 31436/31436 [00:00<00:00, 847983.72it/s]
Total Clips Size: 37658
Start: 0, End: 10000
to process size: 10000
Total size: 10000
Sample show: <|im_start|>system
You are an AI assistant tasked with generating **high-quality object recognition questions and answers** based on a given video.

## TASK:
Your task is to generate **one** high-quality **object recognition question** that requires identifying visible objects, people, or immediate scene attributes from the video.

You must also provide **4 answer options (A–D)**, with only one correct answer, which must be directly observable in the visual content.

## INSTRUCTIONS:
- **Focus on Visual Entities**: The question must test the model’s ability to recognize **objects, tools, clothing, colors, or explicit scene elements**.
- **Ground in Visuals**: All answers must be verifiable by pausing a single frame. Avoid actions, motivations, or temporal reasoning.
- **Avoid Abstract Questions**: Only ask about clearly visible physical entities (e.g., 'What is the man holding?' not 'Why is the man holding this?').
- **Natural Language**: Keep questions concise and neutral (e.g., 'What color is the car?' not 'Can you describe the car's color?').
- **Exclusion Questions**: Include at least one 'NOT visible' question per video (e.g., 'Which item is NOT in the scene?').
- **Format**: Format the output as a list of dictionaries with the following keys:
  - `'Q'`: The question.
  - `'Options'`: A list of four answer options labeled 'A', 'B', 'C', and 'D'.
  - `'Answer'`: The correct answer (e.g., `'A'`, `'B'`, etc.).

## EXAMPLES:
1. {'Q': 'What is the woman wearing on her head?',
   'Options': [
       'A. A red hat',
       'B. A black scarf',
       'C. A white helmet',
       'D. No head covering'
   ],
   'Answer': 'A'}

2. {'Q': 'Which object is NOT visible on the table?',
   'Options': [
       'A. A knife',
       'B. A plate',
       'C. A laptop',
       'D. A spoon'
   ],
   'Answer': 'C'}

3. {'Q': 'What type of vehicle is parked in the driveway?',
   'Options': [
       'A. A motorcycle',
       'B. A pickup truck',
       'C. A bicycle',
       'D. A school bus'
   ],
   'Answer': 'B'}

## GUIDELINES FOR CREATING QUESTIONS:
- **Specificity**: Ask about singular, clearly defined objects or attributes (e.g., 'What tool is in the man’s hand?' not 'What tools are visible?').
- **Visual Certainty**: Ensure the correct answer is unambiguous and visible in a single frame.
- **Plausible Distractors**: Wrong options should be visually similar (e.g., other kitchen tools if asking about a pan).
- **No Implicit Knowledge**: Avoid questions requiring domain knowledge (e.g., 'What brand is the car?' is invalid unless the logo is visible).

## OUTPUT FORMAT:
[{'Q': 'Your question here...', 'Options': ['A. ...', 'B. ...', 'C. ...', 'D. ...'], 'Answer': 'Correct answer here...'}]<|im_end|>
<|im_start|>user
I have provided you with a video description generated by Qwen2.5-VL. Based on this description and the system instructions, generate **one** high-quality object recognition question-and-answer pair.

## REQUIREMENTS:
- The question must focus on **identifying visible objects, people, or scene attributes** (e.g., colors, tools, clothing).
- The answer must be directly observable in the description without any reasoning or inference.
- Include at least one 'NOT visible' question type if possible.

## OUTPUT FORMAT:
[{'Q': 'Your question here...', 'Options': ['A. ...', 'B. ...', 'C. ...', 'D. ...'], 'Answer': 'Correct answer here...'}]

**Only return the QA pair in the specified JSON list format.**<|im_end|>
<|im_start|>assistant