| /share/liangzy/miniconda3/envs/vllm/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml | |
| warnings.warn( | |
| Total Video Size: 31436 | |
| 0%| | 0/31436 [00:00<?, ?it/s] 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 31436/31436 [00:00<00:00, 847983.72it/s] | |
| Total Clips Size: 37658 | |
| Start: 0, End: 10000 | |
| to process size: 10000 | |
| Total size: 10000 | |
| Sample show: <|im_start|>system | |
| You are an AI assistant tasked with generating **high-quality object recognition questions and answers** based on a given video. | |
| ## TASK: | |
| Your task is to generate **one** high-quality **object recognition question** that requires identifying visible objects, people, or immediate scene attributes from the video. | |
| You must also provide **4 answer options (AβD)**, with only one correct answer, which must be directly observable in the visual content. | |
| ## INSTRUCTIONS: | |
| - **Focus on Visual Entities**: The question must test the modelβs ability to recognize **objects, tools, clothing, colors, or explicit scene elements**. | |
| - **Ground in Visuals**: All answers must be verifiable by pausing a single frame. Avoid actions, motivations, or temporal reasoning. | |
| - **Avoid Abstract Questions**: Only ask about clearly visible physical entities (e.g., 'What is the man holding?' not 'Why is the man holding this?'). | |
| - **Natural Language**: Keep questions concise and neutral (e.g., 'What color is the car?' not 'Can you describe the car's color?'). | |
| - **Exclusion Questions**: Include at least one 'NOT visible' question per video (e.g., 'Which item is NOT in the scene?'). | |
| - **Format**: Format the output as a list of dictionaries with the following keys: | |
| - `'Q'`: The question. | |
| - `'Options'`: A list of four answer options labeled 'A', 'B', 'C', and 'D'. | |
| - `'Answer'`: The correct answer (e.g., `'A'`, `'B'`, etc.). | |
| ## EXAMPLES: | |
| 1. {'Q': 'What is the woman wearing on her head?', | |
| 'Options': [ | |
| 'A. A red hat', | |
| 'B. A black scarf', | |
| 'C. A white helmet', | |
| 'D. No head covering' | |
| ], | |
| 'Answer': 'A'} | |
| 2. {'Q': 'Which object is NOT visible on the table?', | |
| 'Options': [ | |
| 'A. A knife', | |
| 'B. A plate', | |
| 'C. A laptop', | |
| 'D. A spoon' | |
| ], | |
| 'Answer': 'C'} | |
| 3. {'Q': 'What type of vehicle is parked in the driveway?', | |
| 'Options': [ | |
| 'A. A motorcycle', | |
| 'B. A pickup truck', | |
| 'C. A bicycle', | |
| 'D. A school bus' | |
| ], | |
| 'Answer': 'B'} | |
| ## GUIDELINES FOR CREATING QUESTIONS: | |
| - **Specificity**: Ask about singular, clearly defined objects or attributes (e.g., 'What tool is in the manβs hand?' not 'What tools are visible?'). | |
| - **Visual Certainty**: Ensure the correct answer is unambiguous and visible in a single frame. | |
| - **Plausible Distractors**: Wrong options should be visually similar (e.g., other kitchen tools if asking about a pan). | |
| - **No Implicit Knowledge**: Avoid questions requiring domain knowledge (e.g., 'What brand is the car?' is invalid unless the logo is visible). | |
| ## OUTPUT FORMAT: | |
| [{'Q': 'Your question here...', 'Options': ['A. ...', 'B. ...', 'C. ...', 'D. ...'], 'Answer': 'Correct answer here...'}]<|im_end|> | |
| <|im_start|>user | |
| I have provided you with a video description generated by Qwen2.5-VL. Based on this description and the system instructions, generate **one** high-quality object recognition question-and-answer pair. | |
| ## REQUIREMENTS: | |
| - The question must focus on **identifying visible objects, people, or scene attributes** (e.g., colors, tools, clothing). | |
| - The answer must be directly observable in the description without any reasoning or inference. | |
| - Include at least one 'NOT visible' question type if possible. | |
| ## OUTPUT FORMAT: | |
| [{'Q': 'Your question here...', 'Options': ['A. ...', 'B. ...', 'C. ...', 'D. ...'], 'Answer': 'Correct answer here...'}] | |
| **Only return the QA pair in the specified JSON list format.**<|im_end|> | |
| <|im_start|>assistant | |